STAT 333 Course Note
Table of Contents
1. Fundamental of Probability
1.1. What's Probability
1.1.1. Examples
Coin toss
Roll a dice
every number in the set: { 1 , 2 , 3 , 4 , 5 , 6 } \{1,2,3,4,5,6\} { 1 , 2 , 3 , 4 , 5 , 6 }
Tomorrow weather
{sunny, rainy, cloudy,...}
Randomly pick a number in [ 0 , 1 ] [0, 1] [ 0 , 1 ]
Although things are random, they are not haphazard/arbitrary. There are "patterns"
Example 1
If we repeat tossing a coin, then the fraction of times that we get a "H" goes to 1 2 \frac{1}{2} 2 1 as the number of toss goes to infinity.
#    o f    " H " t o t a l    #    o f    t o s s = 1 2 \frac{\#\; of \;"H"}{total\; \#\; of\; toss} = \frac{1}{2}
t o t a l # o f t o s s # o f " H " = 2 1
This number 1 / 2 1/2 1 / 2 reflects how "likely" a "H" will appear in one toss (if the experiment is not repeated)
1.2. Probability Models
The Sample space Ω \Omega Ω is the set consisting of al the possible outcomes of a random experiment.
1.2.1. Examples
{ H , T } \{H, T\} { H , T }
{ 1 , 2 , 3 , 4 , 5 , 6 } \{1,2,3,4,5,6\} { 1 , 2 , 3 , 4 , 5 , 6 }
{ s u n n y , r a i n y , c l o u d y , . . . } \{sunny, rainy, cloudy, ...\} { s u n n y , r a i n y , c l o u d y , . . . }
[ 0 , 1 ] [0, 1] [ 0 , 1 ]
An event E ∈ Ω E\in \Omega E ∈ Ω is a subset of Ω \Omega Ω
for which we can talk about "likelihood of happening"; for example
in 2 :
{getting an even number} = {2, 4, 6}
in 4 :
{ t h e    p o i n t    i s    b e t w e e n    0    a n d    1 / 3 } = [ 0 , 1 3 ] \{the \;point\; is\; between\; 0\; and\; 1/3\} = [0, \frac{1}{3}] { t h e p o i n t i s b e t w e e n 0 a n d 1 / 3 } = [ 0 , 3 1 ] is an event
{ t h e    p o i n t    i s    r a t i o n a l } = Q ∩ [ 0 , 1 ] \{the\; point\; is\; rational\} = Q \cap [0, 1] { t h e p o i n t i s r a t i o n a l } = Q ∩ [ 0 , 1 ]
We say an event E E E "happens", if the result of the experiment turns out to belong to E E E (a subset of Ω \Omega Ω )
A probability P P P is a set function ( a mapping from events to real numbers)
P : ξ → R E → P ( E ) \begin{aligned}
P: & \xi \rightarrow R \\
& E \rightarrow P(E)
\end{aligned}
P : ξ → R E → P ( E )
which satisfies the following 3 properties:
∀ E ∈ ξ , 0 ≤ P ( E ) ≤ 1 \forall E \in \xi, 0 \leq P(E) \leq 1 ∀ E ∈ ξ , 0 ≤ P ( E ) ≤ 1
P ( Ω ) = 1 P(\Omega) = 1 P ( Ω ) = 1
For
countably many disjoint events E 1 , E 2 , . . . , E_1, E_2,..., E 1 , E 2 , . . . , we have P ( U i = 1 ∞ E i ) = ∑ i = 1 ∞ P ( E i ) P(U_{i=1}^{\infty}E_i) = \sum_{i=1}^{\infty}P(E_i) P ( U i = 1 ∞ E i ) = ∑ i = 1 ∞ P ( E i )
countable: ∃ \exists ∃ 1-1 mapping to natural numbers 1 , 2 , 3 , . . . 1,2,3,... 1 , 2 , 3 , . . .
Intuitively, one can think the probability of an event as the "likelihood/chance" for the event happens. If we repeat the experiment for a large number of events, the probability is the fraction of time that the event happens
P ( E ) = lim n → ∞ # of times the E happens in n trials n P(E) = \lim_{n\rightarrow\infty} \frac{\text{\# of times the E happens in n trials}}{n}
P ( E ) = n → ∞ lim n # of times the E happens in n trials
1.2.1.1. Example 2
P ( { 1 } ) = P ( { 2 } ) = … = P ( { 6 } ) = 1 6 E = { even number } = { 2 , 4 , 6 } ⇒       P ( E ) = P ( { 2 } ∪ P ( { 4 } ) ) ∪ P ( { 6 } ) = 1 2 \begin{aligned}
&P(\{1\})=P(\{2\})=\ldots=P(\{6\})=\frac{1}{6} \\
&E = \{\text{even number}\} = \{2,4,6\} \\
\Rightarrow \;\; &P(E) = P(\{2\}\cup P(\{4\})) \cup P(\{6\}) = \frac{1}{2}
\end{aligned}
⇒ P ( { 1 } ) = P ( { 2 } ) = … = P ( { 6 } ) = 6 1 E = { even number } = { 2 , 4 , 6 } P ( E ) = P ( { 2 } ∪ P ( { 4 } ) ) ∪ P ( { 6 } ) = 2 1
Properties of probability:
P ( E ) + P ( E c ) = 1 P(E) + P(E^c) = 1 P ( E ) + P ( E c ) = 1
P ( ∅ ) = 0 P(\emptyset)=0 P ( ∅ ) = 0
E 1 ⊆ E 2 ⇒ P ( E 1 ) ≤ P ( E 2 ) E_1\subseteq E_2 \Rightarrow P(E_1)\leq P(E_2) E 1 ⊆ E 2 ⇒ P ( E 1 ) ≤ P ( E 2 )
P ( E 1 ∪ E 2 ) = P ( E 1 ) + P ( E 2 ) − P ( E 1 ∩ E 2 ) P(E_1 \cup E_2) = P(E_1) + P(E_2) - P(E_1 \cap E_2) P ( E 1 ∪ E 2 ) = P ( E 1 ) + P ( E 2 ) − P ( E 1 ∩ E 2 )
-P ( E 1 ∩ E 2 ) P(E_1 \cap E_2) P ( E 1 ∩ E 2 ) : E 1 E_1 E 1 and E 2 E_2 E 2 happen
If the sample space Ω \Omega Ω is discrete , then everything can has at most countable elements be built from the "atoms"
Ω = { w 1 , w 2 , … } P ( w 1 ) = P i P i ∈ [ 0 , 1 ] , ∑ i = 1 ∞ P i = 1 \begin{aligned}
\Omega = \{w_1, w_2, \ldots\} \\
P(w_1) = P_i \\
P_i \in [0, 1], \sum_{i=1}^\infty P_i = 1
\end{aligned}
Ω = { w 1 , w 2 , … } P ( w 1 ) = P i P i ∈ [ 0 , 1 ] , i = 1 ∑ ∞ P i = 1
Then for any event E = { w 1 , i ∈ I } E=\{w_1, i \in I\} E = { w 1 , i ∈ I } , P ( E ) = ∑ i ∈ I P i P(E) = \sum_{i \in I}P_i P ( E ) = ∑ i ∈ I P i
However, if the sample space Ω \Omega Ω is continuous; e.g, [ 0 , 1 ] [0,1] [ 0 , 1 ] in Example 4, then such a construction can not be done for any x ∈ [ 0 , 1 ] x\in [0, 1] x ∈ [ 0 , 1 ] we get P ( { x } = 0 P(\{x\} = 0 P ( { x } = 0 (x x x : the point is exactly x x x )
We can not get P ( [ 0 , 1 3 ] ) P([0, \frac{1}{3}]) P ( [ 0 , 3 1 ] ) by adding P ( { x } ) P(\{x\}) P ( { x } ) for x ≤ 1 3 x\leq \frac{1}{3} x ≤ 3 1 .
This is why we need the notion of event; and we define P P P as a set function from ξ \xi ξ to R R R rather than a function from Ω \Omega Ω to R R R
To summarize: A Probability Space consists of a triplet ( Ω , ξ , P ) (\Omega, \xi, P) ( Ω , ξ , P ) :
Ω \Omega Ω : sample space,
ξ \xi ξ : collection of events
P P P : probability
1.3. Conditional Probability
If we know some information, the probability of an event can be updated
Let E E E , F F F be two events P ( F ) > 0 P(F) > 0 P ( F ) > 0
The conditional probability of E E E , given F F F is
P ( E ∣ F ) = P ( E ∩ F ) P ( F ) P(E\mid F) = \frac{P(E \cap F)}{P(F)}
P ( E ∣ F ) = P ( F ) P ( E ∩ F )
Again, think probability as the long-run frequency:
P ( E ∩ F ) = lim n → ∞ # o f    t i m e s    E    a n d    F    h a p p e n    i n    n    t r a i l s n P ( F ) = lim n → ∞ # o f    t i m e s    F    h a p p e n    i n    n    t r a i l s n ⇒ P ( E ∩ F ) P ( F ) = lim n → ∞ # o f    t i m e s    E    a n d    F    h a p p e n # o f    t i m e s    F    h a p p e n s \begin{aligned}
P(E \cap F) &= \lim_{n\rightarrow\infty}\frac{\# of\; times\; E\; and\; F\; happen\; in\; n\; trails}{n} \\
P(F) &= \lim_{n\rightarrow\infty}\frac{\# of\; times\; F\; happen\; in\; n\; trails}{n} \\
\Rightarrow \frac{P(E\cap F)}{P(F)} &= \lim_{n\rightarrow\infty}\frac{\# of\; times\; E\; and\; F\; happen}{\# of \;times \;F \;happens}
\end{aligned}
P ( E ∩ F ) P ( F ) ⇒ P ( F ) P ( E ∩ F ) = n → ∞ lim n # o f t i m e s E a n d F h a p p e n i n n t r a i l s = n → ∞ lim n # o f t i m e s F h a p p e n i n n t r a i l s = n → ∞ lim # o f t i m e s F h a p p e n s # o f t i m e s E a n d F h a p p e n
By definition
P ( E ∩ F ) = P ( E ∣ F ) ⋅ P ( F ) P(E\cap F) = P(E\mid F) \cdot P(F)
P ( E ∩ F ) = P ( E ∣ F ) ⋅ P ( F )
1.4. Independence
Def : Two events E E E and F F F are said to be independent, if P ( E ∩ F ) = P ( E ) ⋅ P ( F ) P(E\cap F)=P(E)\cdot P(F) P ( E ∩ F ) = P ( E ) ⋅ P ( F ) ; denoted as E ⊥ ​ ​ ​ ⊥ F E\perp\!\!\!\perp F E ⊥ ⊥ F . This is different rom disjoint.
Assume P ( F ) > 0 P(F)>0 P ( F ) > 0 , then E ⊥ ​ ​ ​ ⊥ F ⇔ P ( E ∣ F ) = P ( E ) E\perp\!\!\!\perp F \Leftrightarrow P(E|F)=P(E) E ⊥ ⊥ F ⇔ P ( E ∣ F ) = P ( E ) ; intuitively, knowing F F F does not change the probability of E E E .
Proof :
E ⊥ ​ ​ ​ ⊥ F ⇔ P ( E ∩ F ) = P ( E ) ⋅ P ( F ) ⇔ P ( E ∩ F ) P ( F ) = P ( E ) ⇔ P ( E ∣ F ) = P ( E ) \begin{aligned}
E\perp\!\!\!\perp F & \Leftrightarrow P(E\cap F) = P(E)\cdot P(F) \\
& \Leftrightarrow \frac{P(E\cap F)}{P(F)} = P(E) \\
& \Leftrightarrow P(E|F) = P(E)
\end{aligned} E ⊥ ⊥ F ⇔ P ( E ∩ F ) = P ( E ) ⋅ P ( F ) ⇔ P ( F ) P ( E ∩ F ) = P ( E ) ⇔ P ( E ∣ F ) = P ( E )
More generally, a sequence of events E 1 , E 2 , … E_1, E_2, \ldots E 1 , E 2 , … are called independent if for any finite index set I I I ,
P ( ⋂ i ∈ I E i ) = ∏ i ∈ I P ( E i ) P(\bigcap_{i\in I}E_i)=\prod_{i\in I} P(E_i)
P ( i ∈ I ⋂ E i ) = i ∈ I ∏ P ( E i )
1.5. Bayes' rule and law of total probability
Theorem : Let F 1 , F 2 , … F_1, F_2,\ldots F 1 , F 2 , … be disjoint events, and ⋂ i = 1 ∞ F i = Ω \bigcap_{i=1}^\infty F_i=\Omega ⋂ i = 1 ∞ F i = Ω , we say { F i } i = 1 ∞ \{F_i\}_{i=1}^\infty { F i } i = 1 ∞ forms a "partition" of the sample space Ω \Omega Ω
Then P ( E ) = ∑ i = 1 ∞ P ( E ∣ F i ) ⋅ P ( F i ) P(E)=\sum_{i=1}^\infty P(E|F_i)\cdot P(F_i) P ( E ) = ∑ i = 1 ∞ P ( E ∣ F i ) ⋅ P ( F i )
Proof : Exercise
Intuition: Decompose the total probability into different cases.
P ( E ∩ F 2 ) = P ( E ∣ F 2 ) ⋅ P ( F 2 ) P(E\cap F_2) = P(E|F_2)\cdot P(F_2)
P ( E ∩ F 2 ) = P ( E ∣ F 2 ) ⋅ P ( F 2 )
1.5.1. Bayes' rule
P ( F i ∣ E ) = P ( E ∣ F i ) ⋅ P ( F i ) ∑ j = 1 ∞ P ( E ∣ F j ) ⋅ P ( F j ) P(F_i | E) = \frac{P(E|F_i)\cdot P(F_i)}{\sum_{j=1}^\infty P(E|F_j)\cdot P(F_j)}
P ( F i ∣ E ) = ∑ j = 1 ∞ P ( E ∣ F j ) ⋅ P ( F j ) P ( E ∣ F i ) ⋅ P ( F i )
Bayes' rule tells us how to find conditional probability by switching the role of the event and the condition.
Proof :
P ( F i ∣ E ) = P ( F i ∩ E ) P ( E ) definition of condition probability = P ( E ∣ F i ) P ( F i ) P ( E ) = P ( E ∣ F i ) P ( F i ) ∑ j = 1 ∞ P ( E ∣ F j ) P ( F j ) law of total probability \begin{aligned}
P(F_i | E) & = \frac{P(F_i\cap E)}{P(E)} & \hspace{1em}\text{definition of condition probability} \\
& = \frac{P(E|F_i)P(F_i)}{P(E)} \\
& = \frac{P(E|F_i)P(F_i)}{\sum_{j=1}^\infty P(E|F_j)P(F_j)} & \text{law of total probability}
\end{aligned} P ( F i ∣ E ) = P ( E ) P ( F i ∩ E ) = P ( E ) P ( E ∣ F i ) P ( F i ) = ∑ j = 1 ∞ P ( E ∣ F j ) P ( F j ) P ( E ∣ F i ) P ( F i ) definition of condition probability law of total probability
2. Random variables and distributions
2.1. Random variables
( Ω , ξ , P ) (\Omega,\xi, P) ( Ω , ξ , P ) : Probability space.
Definition : A random variable X X X (or r.v.) is a mapping from Ω \Omega Ω to R \R R
X : Ω → R ω → X ( ω ) \begin{aligned}
X : & \Omega\rightarrow \R \\
& \omega \rightarrow X(\omega)
\end{aligned} X : Ω → R ω → X ( ω )
A random variable transforms arbitrary "outcomes" into numbers.
X X X introduces a probability on R R R . For A ⊆ R A\subseteq R A ⊆ R , define
P ( X ∈ A ) : = P ( { X ( ω ) ∈ A } ) = P ( { ω : X ( ω ) ∈ A } ) = P ( X − 1 ( A ) ) \begin{aligned}
P(X\in A) :&= P(\{X(\omega)\in A\}) \\
&= P(\{\omega:X(\omega)\in A\}) \\
&= P(X^{-1}(A))
\end{aligned} P ( X ∈ A ) : = P ( { X ( ω ) ∈ A } ) = P ( { ω : X ( ω ) ∈ A } ) = P ( X − 1 ( A ) )
From now on, we can often "forget" te original probability space and focus on the random variables and their distributions.
Definition : let X X X be a random variable. The CDF (cumulative distribution function) F F F of X X X is defined by
F ( x ) = P ( X ≤ x ) = P ( X ∈ ( − ∞ , x ] ) ) X : random variable , x : number F(x) = P(X\leq x) = P(X\in(-\infty, x])) \\
X: \text{random variable}, x: \text{number}
F ( x ) = P ( X ≤ x ) = P ( X ∈ ( − ∞ , x ] ) ) X : random variable , x : number
Properties of cdf:
F F F is non-decreasing. F ( x 1 ) ≤ F ( x 2 ) , x 1 < x 2 F(x_1)\leq F(x_2), x_1 < x_2 F ( x 1 ) ≤ F ( x 2 ) , x 1 < x 2
limits
lim x → − ∞ F ( x ) = 0 \lim_{x\rightarrow -\infty}F(x) = 0 lim x → − ∞ F ( x ) = 0
lim x → ∞ F ( x ) = 1 \lim_{x\rightarrow\infty}F(x)=1 lim x → ∞ F ( x ) = 1
F ( x ) F(x) F ( x ) is right continuous
l i m x ↓ a F ( x ) = F ( a ) lim_{x\downarrow a}F(x) = F(a) l i m x ↓ a F ( x ) = F ( a ) : x x x decreases to a a a (approaching from the right)
Hint: { x ≤ a } = ⋂ i = 1 ∞ { X ≤ a i } \{x\leq a\}=\bigcap_{i=1}^\infty\{X\leq a_i\} { x ≤ a } = ⋂ i = 1 ∞ { X ≤ a i } for a i ↓ a a_i\downarrow a a i ↓ a
2.2. Discrete random variables and distributions
A random variable X X X is called discrete if it only takes values in an at most countable set { x 1 , x 2 , … } \{x_1, x_2, \ldots\} { x 1 , x 2 , … } (finite or countable).
The distribution of a discrete random variable is fully characterized by its probability mass function (p.m.f)
p ( x ) : = P ( X = x ) ; x = x 1 , x 2 , … p(x):=P(X=x); x=x_1, x_2, \ldots
p ( x ) : = P ( X = x ) ; x = x 1 , x 2 , …
Properties of pmf:
p ( x ) ≥ 0 ,       ∀ x p(x)\geq 0, \;\;\forall x p ( x ) ≥ 0 , ∀ x
∑ i p ( x i ) = 1 \sum_ip(x_i) = 1 ∑ i p ( x i ) = 1
Q: what does the cdf of a discrete random variable look like?
2.2.1. Examples of discrete distributions
1. Bemoulli distribution
p ( 1 ) = P ( X = 1 ) = p p ( c ) = P ( X = c ) = 1 − p p ( x ) = 0 o t h e r w i s e \begin{aligned}
p(1) &= P(X=1)=p \\
p(c) &= P(X=c) = 1-p \\
p(x) &= 0 \quad otherwise
\end{aligned} p ( 1 ) p ( c ) p ( x ) = P ( X = 1 ) = p = P ( X = c ) = 1 − p = 0 o t h e r w i s e
Denote X ∼ B e r ( p ) X\sim Ber(p) X ∼ B e r ( p )
2.Binomial distribution
p ( k ) = P ( X = k ) = ⟮ k n ⟯ p k ( 1 − p ) n − k \begin{aligned}
p(k) = P(X=k) = \lgroup\stackrel{n}{k}\rgroup p^k(1-p)^{n-k}
\end{aligned} p ( k ) = P ( X = k ) = ⟮ k n ⟯ p k ( 1 − p ) n − k
X ∼ B i n ( n , p ) X\sim Bin(n, p) X ∼ B i n ( n , p ) to choose k k k successes.
Binomial distribution is the distribution of number of successes in n n n independent trials; each having probability p p p of success.
3.Geometric distribution
p ( k ) = P ( X = k ) = ( 1 − p ) k − 1 p p(k) = P(X=k)=(1-p)^{k-1}p \\
p ( k ) = P ( X = k ) = ( 1 − p ) k − 1 p
( 1 − p ) k − 1 : the first k-1 trials are all failures ,       p : success in k t h trial (1-p)^{k-1}: \text{the first k-1 trials are all failures},\;\; p: \text{success in }k^{th}\text{ trial}
( 1 − p ) k − 1 : the first k-1 trials are all failures , p : success in k t h trial
X ∼ G e o ( p ) X\sim Geo(p) X ∼ G e o ( p )
X X X is the number of trials needed to get the first success in n n n independent trials with probability p p p of success each
X X X has the memoryless property
P ( X > n + m ∣ X > m ) = P ( x > n ) n , m = 0 , 1 , … P(X>n+m|X>m) = P(x>n) \hspace{2em} n, m=0,1,\ldots P ( X > n + m ∣ X > m ) = P ( x > n ) n , m = 0 , 1 , …
Memoryless property :
p ( X > n + m ∣ X > m ) = P ( X > n ) p(X>n+m|X>m)=P(X>n)
p ( X > n + m ∣ X > m ) = P ( X > n )
Proof :
P ( X > k ) = ∑ j = k + 1 ∞ P ( X = j ) = ∑ j = k + 1 ∞ ( 1 − p ) j − 1 p = ( 1 − p ) k p ⋅ 1 1 − ( 1 − p ) = ( 1 − p ) k \begin{aligned}
P(X>k) & = \sum_{j=k+1}^\infty P(X=j) \\
& = \sum_{j=k+1}^\infty (1-p)^{j-1}p \\
& = (1-p)^kp \cdot \frac{1}{1-(1-p)} \\
& = (1-p)^k \\
\end{aligned}
P ( X > k ) = j = k + 1 ∑ ∞ P ( X = j ) = j = k + 1 ∑ ∞ ( 1 − p ) j − 1 p = ( 1 − p ) k p ⋅ 1 − ( 1 − p ) 1 = ( 1 − p ) k
P ( X > n + m ∣ x > m ) = P ( X > n + m ⋂ X > m ) P ( X > m ) = P ( X > n + m ) P ( X > m ) = ( 1 − p ) n + m ( 1 − p ) m = ( 1 − p ) n = P ( X > n ) \begin{aligned}
P(X>n+m|x>m) &= \frac{P(X>n+m\bigcap X>m)}{P(X>m)} \\
&= \frac{P(X>n+m)}{P(X>m)} = \frac{(1-p)^{n+m}}{(1-p)^m} = (1-p)^n=P(X>n)
\end{aligned}
P ( X > n + m ∣ x > m ) = P ( X > m ) P ( X > n + m ⋂ X > m ) = P ( X > m ) P ( X > n + m ) = ( 1 − p ) m ( 1 − p ) n + m = ( 1 − p ) n = P ( X > n )
Intuition : The failures in the past have no influence on how long we still need to wait to get the first success in the future
4. Poisson distribution
p ( k ) = P ( X = k ) = λ k e − λ k ! , k = 0 , 1 , 2 , … , λ > 0 p(k)=P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}, k=0,1,2,\ldots, \lambda > 0
p ( k ) = P ( X = k ) = k ! λ k e − λ , k = 0 , 1 , 2 , … , λ > 0
Other discrete distributions:
negative binomial
discrete uniform
2.3. Continuous random variables and distributions
Definition : A random variable X X X is called continuous if there exists a non-negative function f f f , such that for any interval [ a , b ] [a,b] [ a , b ] , ( a , b ) (a,b) ( a , b ) or [ a , b ) [a,b) [ a , b ) :
P ( X ∈ [ a , b ] ) = ∫ a b f ( x ) d x P(X\in[a,b]) = \int_a^b f(x)dx
P ( X ∈ [ a , b ] ) = ∫ a b f ( x ) d x
The function f f f is called the probability density function(pdf) of X X X
Remark : probability density function(pdf) is not probability. P ( X = x ) = 0 P(X=x)=0 P ( X = x ) = 0 if X X X is continuous. The probability density function f f f only gives probability when it is integrated.
If X X X is continuous, then we can get cdf by:
F ( a ) = P ( X ∈ ( − ∞ , a ] ) = ∫ − ∞ a f ( x ) d x F(a)=P(X\in(-\infty, a])=\int^a_{-\infty} f(x)dx
F ( a ) = P ( X ∈ ( − ∞ , a ] ) = ∫ − ∞ a f ( x ) d x
hence, F ( x ) F(x) F ( x ) is continuous, and differentiable "almost everywhere".
We can take f ( x ) = F ′ ( x ) f(x)=F'(x) f ( x ) = F ′ ( x ) when the derivative exists, and f ( x ) = f(x)= f ( x ) = arbitrary number otherwise often to choose a value to make f f f have some continuity.
Property of pdf:
f ( x ) ≤ 0 f(x)\leq 0 f ( x ) ≤ 0 , x ∈ R x\in R x ∈ R
∫ − ∞ ∞ f ( x ) d x = 1 \int_{-\infty}^\infty f(x)dx = 1 ∫ − ∞ ∞ f ( x ) d x = 1
For A ⊆ R , P ( X ∈ A ) = ∫ A f ( x ) d x A\subseteq R, P(X\in A)=\int_A f(x)dx A ⊆ R , P ( X ∈ A ) = ∫ A f ( x ) d x
2.3.1. Example of continuous distribution
Exponential distribution
f ( x ) = { λ e − λ x , x ≥ 0 0 , x ≤ 0 X ∼ E x p ( x ) f(x)= \begin{cases}
\begin{aligned}
&\lambda e^{-\lambda x} &, x\geq 0 \\
&0 &, x\leq 0
\end{aligned}
\end{cases} \\
X\sim Exp(x)
f ( x ) = { λ e − λ x 0 , x ≥ 0 , x ≤ 0 X ∼ E x p ( x )
Other continuous distributions:
Normal distribution
Uniform distribution
Exercises:
Find the cdf of X ∼ E x p ( x ) X\sim Exp(x) X ∼ E x p ( x ) F ( k ) = P ( X ≤ k ) = ∫ − ∞ k f ( x ) d x = ∫ 0 k λ e − λ x d x = − e − λ x ∣ 0 k = − e − λ k − ( − e 0 ) = 1 − e − λ k \begin{aligned}
F(k) = P(X\leq k)
&= \int_{-\infty}^k f(x)dx \\
&= \int_0^k \lambda e^{-\lambda x}dx \\
&= -e^{-\lambda x} \Big|^k_0 \\
&= -e^{-\lambda k} - (-e^0) \\
&= 1 - e^{-\lambda k}
\end{aligned}
F ( k ) = P ( X ≤ k ) = ∫ − ∞ k f ( x ) d x = ∫ 0 k λ e − λ x d x = − e − λ x ∣ ∣ ∣ 0 k = − e − λ k − ( − e 0 ) = 1 − e − λ k
Show that the exponential distribution has the memoryless property:P ( X > t + s ∣ X > t ) = P ( X > s ) P(X>t+s|X>t)=P(X>s)
P ( X > t + s ∣ X > t ) = P ( X > s )
2.4. Joint distribution of r.v's
Let X X X and Y Y Y be two r.v's. defined on the same probability space ( Ω , ξ , P ) (\Omega, \xi, P) ( Ω , ξ , P )
For each ω ∈ Ω \omega \in \Omega ω ∈ Ω , we have at the same time X ( ω ) X(\omega) X ( ω ) and Y ( ω ) Y(\omega) Y ( ω ) . Then we can talk about the joint behavior of X X X and Y Y Y
Two joint distribution of r.v's is characterized by joint cdf, joint pmf(discrete case) or joint pdf(continuous case).
Joint cdf:
F ( x , y ) = P ( X < x , Y < y ) F(x,y) = P(X<x, Y<y) F ( x , y ) = P ( X < x , Y < y )
Joint pmf:
f ( x , y ) = P ( X = x , Y = y ) f(x,y)=P(X=x, Y=y) f ( x , y ) = P ( X = x , Y = y )
joint pdf f ( x , y ) f(x,y) f ( x , y ) such that for a < b , c < d a<b, c<d a < b , c < d
P ( X , Y ) ∈ ( a , b ] × ( c , d ] = P ( X ∈ ( a , b ] , Y ∈ ( c , d ] ) = ∫ a b ∫ c d f ( x , y ) d y d x P(X,Y)\in(a,b]\times(c,d] = P(X\in(a,b], Y\in(c,d])=\int_a^b\int_c^d f(x,y)dy dx P ( X , Y ) ∈ ( a , b ] × ( c , d ] = P ( X ∈ ( a , b ] , Y ∈ ( c , d ] ) = ∫ a b ∫ c d f ( x , y ) d y d x
Equivalently:
F ( x , y ) = ∫ − ∞ x ∫ − ∞ y f ( s , t ) d t d s and F(x,y)=\int_{-\infty}^x\int_{-\infty}^y f(s,t)dtds \\\text{and} \\ F ( x , y ) = ∫ − ∞ x ∫ − ∞ y f ( s , t ) d t d s and f ( x , y ) = ∂ 2 ∂ x ∂ y F ( x , y ) f(x,y)=\frac{\partial^2}{\partial x \partial y}F(x,y) f ( x , y ) = ∂ x ∂ y ∂ 2 F ( x , y )
P ( ( X , Y ) ∈ A ) = ∫ ∫ A f ( x , y ) d x d y P((X,Y)\in A) = \int\int_A f(x,y)dxdy P ( ( X , Y ) ∈ A ) = ∫ ∫ A f ( x , y ) d x d y for A ⊆ R 2 A\subseteq R^2 A ⊆ R 2
Definition : Two r.v's X X X and Y Y Y are called independent, if for all sets A , B ⊆ R A,B\subseteq R A , B ⊆ R ,
P ( X < A , Y < B ) = P ( X ∈ A ) ⋅ P ( Y ∈ B ) P(X<A,Y<B)=P(X\in A)\cdot P(Y\in B)
P ( X < A , Y < B ) = P ( X ∈ A ) ⋅ P ( Y ∈ B )
({ X ∈ A } \{X\in A\} { X ∈ A } and { Y ∈ B } \{Y\in B\} { Y ∈ B } are independent events)
Theorem : Two r.v's X X X and Y Y Y are
independent, if and only if
F ( x , y ) = F x ( x ) F y ( y ) ; x , y ∈ R F(x,y)=F_x(x)F_y(y); x,y\in R F ( x , y ) = F x ( x ) F y ( y ) ; x , y ∈ R ; where F x F_x F x : cdf of x; F y F_y F y : cdf of y
f ( x , y ) = f x ( x ) f y ( y ) ; x , y ∈ R f(x,y)=f_x(x)f_y(y); x,y\in R f ( x , y ) = f x ( x ) f y ( y ) ; x , y ∈ R ; where f f f is the joint pmf/pdf of X X X and Y Y Y ; f x f_x f x , f y f_y f y are marginal pmf/pdf of X X X and Y Y Y , respectively
Proof :
1.⇒ \Rightarrow ⇒ 2.
If X ⊥ ​ ​ ​ ⊥ Y X \perp\!\!\!\perp Y X ⊥ ⊥ Y , then by definition,
F ( x , y ) = P ( X ∈ ( − ∞ , x ] , Y ∈ ( − ∞ , y ] ) ) = P ( X ∈ ( − ∞ , x ] ) ) ⋅ P ( Y ∈ ( − ∞ , y ] ) ) = F x ( x ) F y ( y ) F(x,y)=P(X\in(-\infty, x],Y\in(-\infty, y])) = P(X\in(-\infty, x]))\cdot P(Y\in(-\infty,y])) = F_x(x)F_y(y)
F ( x , y ) = P ( X ∈ ( − ∞ , x ] , Y ∈ ( − ∞ , y ] ) ) = P ( X ∈ ( − ∞ , x ] ) ) ⋅ P ( Y ∈ ( − ∞ , y ] ) ) = F x ( x ) F y ( y )
2.⇒ \Rightarrow ⇒ 3.
Assume F ( x , y ) = F x ( x ) ⋅ F y ( y ) F(x,y)=F_x(x)\cdot F_y(y) F ( x , y ) = F x ( x ) ⋅ F y ( y ) ,
f ( x , y ) = ∂ 2 ∂ x ∂ y F ( x , y ) = ∂ 2 ∂ x ∂ y F x ( x ) F y ( y ) = ( ∂ ∂ x F x ( x ) ) ( ∂ ∂ y F y ( y ) ) = f x ( x ) f y ( y ) \begin{aligned}
f(x,y)=\frac{\partial^2}{\partial x \partial y}F(x,y)
& = \frac{\partial^2}{\partial x \partial y} F_x(x)F_y(y) \\
&= (\frac{\partial}{\partial x}F_x(x))(\frac{\partial}{\partial y}F_y(y)) \\
&= f_x(x)f_y(y)
\end{aligned}
f ( x , y ) = ∂ x ∂ y ∂ 2 F ( x , y ) = ∂ x ∂ y ∂ 2 F x ( x ) F y ( y ) = ( ∂ x ∂ F x ( x ) ) ( ∂ y ∂ F y ( y ) ) = f x ( x ) f y ( y )
3.⇒ \Rightarrow ⇒ 1.
Assume f ( x , y ) = f x ( x ) f y ( y ) f(x,y)=f_x(x)f_y(y) f ( x , y ) = f x ( x ) f y ( y ) ; For A , B ⊆ R A,B\subseteq R A , B ⊆ R ,
P ( X ∈ A , Y ∈ B ) = ∫ y ∈ B ∫ x ∈ A f ( x , y ) d x d y = ∫ y ∈ B ∫ x ∈ A f x ( x ) f y ( y ) d x d y = ( ∫ x ∈ A f x ( x ) d x ) ( ∫ y ∈ B f y ( y ) d y ) = P ( X ∈ A ) P ( Y ∈ B ) \begin{aligned}
P(X\in A, Y\in B) &= \int_{y\in B}\int_{x\in A} f(x,y) dxdy \\
&= \int_{y\in B}\int_{x\in A} f_x(x)f_y(y)dxdy \\
&= (\int_{x\in A} f_x(x)dx) (\int_{y\in B} f_y(y)dy) \\
&= P(X\in A)P(Y\in B)
\end{aligned}
P ( X ∈ A , Y ∈ B ) = ∫ y ∈ B ∫ x ∈ A f ( x , y ) d x d y = ∫ y ∈ B ∫ x ∈ A f x ( x ) f y ( y ) d x d y = ( ∫ x ∈ A f x ( x ) d x ) ( ∫ y ∈ B f y ( y ) d y ) = P ( X ∈ A ) P ( Y ∈ B )
2.5. Expectation
Definition : For a r.v X X X , the expectation of g ( x ) g(x) g ( x ) is defined as
E ( g ( X ) ) = { ∑ i = 1 ∞ g ( x i ) P ( X = x i ) for discrete X ∫ − ∞ ∞ g ( x ) f ( x ) d x for continuous X \mathbb{E}(g(X))= \begin{cases}
\begin{aligned}
& \sum_{i=1}^\infty g(x_i)P(X=x_i) & \text{for discrete } X \\
& \int_{-\infty}^\infty g(x)f(x)dx & \text{for continuous} X
\end{aligned}
\end{cases}
E ( g ( X ) ) = ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ i = 1 ∑ ∞ g ( x i ) P ( X = x i ) ∫ − ∞ ∞ g ( x ) f ( x ) d x for discrete X for continuous X
Let X X X ,Y Y Y be two r.v's; then the expectation of g ( X , Y ) g(X,Y) g ( X , Y ) is defined in a similar way.
E ( g ( X , Y ) ) = { ∑ i ∑ j g ( x i , y j ) P ( X = x i , Y = y j ) ∫ ∫ g ( x i , y j ) f ( x , y ) d x d y \mathbb{E}(g(X,Y)) = \begin{cases}
\begin{aligned}
&\sum_i\sum_j g(x_i,y_j)P(X=x_i, Y=y_j)\\
& \int\int g(x_i, y_j)f(x,y)dxdy
\end{aligned}
\end{cases}
E ( g ( X , Y ) ) = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ i ∑ j ∑ g ( x i , y j ) P ( X = x i , Y = y j ) ∫ ∫ g ( x i , y j ) f ( x , y ) d x d y
2.5.1. Properties of expectation
Linearity:expectation of X X X : E ( X ) = { ∑ x i P ( X = x i ) ∫ − ∞ ∞ x f ( x ) d x \mathbb{E}(X)= \begin{cases}
\sum x_i P(X=x_i) \\
\int_{-\infty}^{\infty} xf(x)dx
\end{cases} E ( X ) = { ∑ x i P ( X = x i ) ∫ − ∞ ∞ x f ( x ) d x , g ( X ) = x g(X)=x g ( X ) = x
E ( a x + b ) = a E ( x ) + b \mathbb{E}(ax+b)=a\mathbb{E}(x)+b E ( a x + b ) = a E ( x ) + b
E ( X + Y ) = E ( X ) + E ( Y ) \mathbb{E}(X+Y)=\mathbb{E}(X)+\mathbb{E}(Y) E ( X + Y ) = E ( X ) + E ( Y )
If X ⊥ ​ ​ ​ ⊥ Y X\perp \!\!\! \perp Y X ⊥ ⊥ Y , then E ( g ( X ) h ( Y ) ) = E ( g ( X ) ) ⋅ E ( h ( Y ) ) \mathbb{E}(g(X)h(Y))=\mathbb{E}(g(X))\cdot \mathbb{E}(h(Y)) E ( g ( X ) h ( Y ) ) = E ( g ( X ) ) ⋅ E ( h ( Y ) )
proof : (continuous case)
E ( g ( X ) h ( Y ) ) = ∫ − ∞ ∞ ∫ − ∞ ∞ g ( x ) h ( y ) f ( x , y ) d x d y = ∫ − ∞ ∞ ∫ − ∞ ∞ g ( x ) h ( y ) f X ( x ) f Y ( y ) d x d y = ∫ − ∞ ∞ g ( x ) f X ( x ) ⋅ ∫ − ∞ ∞ h ( y ) f Y ( y ) d y \begin{aligned}
\mathbb{E}(g(X)h(Y))
&= \int_{-\infty}^\infty\int_{-\infty}^\infty g(x)h(y)f(x,y)dxdy \\
&= \int_{-\infty}^\infty\int_{-\infty}^\infty g(x)h(y)f_X(x)f_Y(y)dxdy \\
&= \int_{-\infty}^\infty g(x)f_X(x)\cdot\int_{-\infty}^\infty h(y)f_Y(y)dy\\
\end{aligned}
E ( g ( X ) h ( Y ) ) = ∫ − ∞ ∞ ∫ − ∞ ∞ g ( x ) h ( y ) f ( x , y ) d x d y = ∫ − ∞ ∞ ∫ − ∞ ∞ g ( x ) h ( y ) f X ( x ) f Y ( y ) d x d y = ∫ − ∞ ∞ g ( x ) f X ( x ) ⋅ ∫ − ∞ ∞ h ( y ) f Y ( y ) d y
In particular, E ( X Y ) = E ( X ) E ( Y ) \mathbb{E}(XY)=\mathbb{E}(X)\mathbb{E}(Y) E ( X Y ) = E ( X ) E ( Y ) if X ⊥ ​ ​ ​ ⊥ Y X\perp \!\!\! \perp Y X ⊥ ⊥ Y
2.5.2. Definitions
Definition : The expectation E ( X n ) \mathbb{E}(X^n) E ( X n ) is called the n-th moment of X X X :
1st moment: E ( X ) \mathbb{E}(X) E ( X )
2st moment: E ( X 2 ) \mathbb{E}(X^2) E ( X 2 )
Definition : The variance of a r.v X X X is defined as:
V a r ( x ) = E ( ( X − E ( X ) ) 2 ) also denoted as σ 2 , σ x 2 Var(x)=\mathbb{E}((X-\mathbb{E}(X))^2) \text{ also denoted as } \sigma^2, \sigma_x^2
V a r ( x ) = E ( ( X − E ( X ) ) 2 ) also denoted as σ 2 , σ x 2
Definition : the covariance of the r.v's X X X and Y Y Y is defined as:
C o v ( X , Y ) = E [ ( X − E ( X ) ) ( Y − E ( Y ) ) ] Cov(X,Y)=\mathbb{E}[(X-\mathbb{E}(X))(Y-\mathbb{E}(Y))]
C o v ( X , Y ) = E [ ( X − E ( X ) ) ( Y − E ( Y ) ) ]
Thus V a r ( X ) = C o v ( X , X ) Var(X)=Cov(X,X) V a r ( X ) = C o v ( X , X )
Definition : the correlation between X X X and Y Y Y is defined as:
C o r ( X , Y ) = C o v ( X , Y ) V a r ( X ) V a r ( Y ) Cor(X,Y)=\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}
C o r ( X , Y ) = V a r ( X ) V a r ( Y ) C o v ( X , Y )
Fact : V a r ( X ) = E ( X 2 ) − ( E ( X ) ) 2 Var(X)=\mathbb{E}(X^2)-(\mathbb{E}(X))^2 V a r ( X ) = E ( X 2 ) − ( E ( X ) ) 2
Proof :
V a r ( X ) = E ( ( X − E ( X ) ) 2 ) = E ( X 2 − 2 X E ( X ) + ( E ( X ) ) 2 ) = E ( X 2 ) − 2 E ( X E ( X ) ) + ( E ( X ) ) 2 = E ( X 2 ) − 2 ( E ( X ) ) 2 + ( E ( X ) ) 2 = E ( X 2 ) − ( E ( X ) ) 2 ■ \begin{aligned}
Var(X) &= \mathbb{E}((X-\mathbb{E}(X))^2) \\
&= \mathbb{E}(X^2-2X\mathbb{E}(X)+(\mathbb{E}(X))^2) \\
&= \mathbb{E}(X^2)-2\mathbb{E}(X\mathbb{E}(X))+(\mathbb{E}(X))^2 \\
&= \mathbb{E}(X^2)-2(\mathbb{E}(X))^2+(\mathbb{E}(X))^2 \\
&= \mathbb{E}(X^2)-(\mathbb{E}(X))^2 \quad\quad\blacksquare
\end{aligned}
V a r ( X ) = E ( ( X − E ( X ) ) 2 ) = E ( X 2 − 2 X E ( X ) + ( E ( X ) ) 2 ) = E ( X 2 ) − 2 E ( X E ( X ) ) + ( E ( X ) ) 2 = E ( X 2 ) − 2 ( E ( X ) ) 2 + ( E ( X ) ) 2 = E ( X 2 ) − ( E ( X ) ) 2 ■
Fact : C o v ( X , Y ) = E ( X Y ) − E ( X ) E ( Y ) Cov(X,Y)=\mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y) C o v ( X , Y ) = E ( X Y ) − E ( X ) E ( Y )
Proof :
C o v ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] = E [ X Y − X E [ Y ] − Y E [ X ] + E [ X ] E [ Y ] ] = E [ X Y ] − E [ X E [ Y ] ] − E [ Y E [ X ] ] + E [ E ] X ] E [ Y ] ) = E [ X Y ] − E [ X ] E [ Y ] − E [ Y ] E [ X ] + E [ X ] E [ Y ] = E [ X Y ] − E [ X ] E [ Y ] ■ \begin{aligned}
Cov(X,Y)
&=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] \\
&=\mathbb{E}[XY - X\mathbb{E}[Y]-Y\mathbb{E}[X] + \mathbb{E}[X]\mathbb{E}[Y]] \\
&=\mathbb{E}[XY] - \mathbb{E}[X\mathbb{E}[Y]] - \mathbb{E}[Y\mathbb{E}[X]] + \mathbb{E}[E]X]\mathbb{E}[Y]) \\
&=\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[Y]\mathbb{E}[X] + \mathbb{E}[X]\mathbb{E}[Y] \\
&=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y] \quad\quad\blacksquare
\end{aligned}
C o v ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] = E [ X Y − X E [ Y ] − Y E [ X ] + E [ X ] E [ Y ] ] = E [ X Y ] − E [ X E [ Y ] ] − E [ Y E [ X ] ] + E [ E ] X ] E [ Y ] ) = E [ X Y ] − E [ X ] E [ Y ] − E [ Y ] E [ X ] + E [ X ] E [ Y ] = E [ X Y ] − E [ X ] E [ Y ] ■
Variance and covariance are translation invariant . Variance is quadratic, covariance is bilinear.
V a r ( a X + b ) = a 2 ⋅ V a r ( X ) Var(aX+b)=a^2\cdot Var(X)
V a r ( a X + b ) = a 2 ⋅ V a r ( X )
C o v ( a X + b , c Y + d ) = a c ⋅ C o v ( X , Y ) Cov(aX+b, cY+d)=ac\cdot Cov(X,Y)
C o v ( a X + b , c Y + d ) = a c ⋅ C o v ( X , Y )
Proof : V a r ( a X + b ) = a 2 ⋅ V a r ( X ) Var(aX+b)=a^2\cdot Var(X) V a r ( a X + b ) = a 2 ⋅ V a r ( X )
V a r ( a X + b ) = E ( ( a X + b ) 2 ) − ( E ( a X + b ) ) 2 = E ( a 2 X 2 + 2 a b X + b 2 ) − ( a E ( X ) + b ) 2 = a 2 E ( X 2 ) + 2 a b E ( X ) + b 2 − a 2 E 2 ( X ) − a b E ( X ) − b 2 = a 2 E ( X 2 ) − a 2 E 2 ( X ) = a 2 V a r ( X ) ■ \begin{aligned}
Var(aX+b) &= \mathbb{E}((aX+b)^2)-(\mathbb{E}(aX+b))^2 \\
&= \mathbb{E}(a^2X^2 + 2abX + b^2) - (a\mathbb{E}(X)+b)^2 \\
&= a^2\mathbb{E}(X^2) + 2ab\mathbb{E}(X)+b^2 - a^2\mathbb{E}^2(X) - ab\mathbb{E}(X) - b^2 \\
&= a^2\mathbb{E}(X^2)-a^2\mathbb{E}^2(X) \\
&= a^2 Var(X) \quad\quad\blacksquare
\end{aligned}
V a r ( a X + b ) = E ( ( a X + b ) 2 ) − ( E ( a X + b ) ) 2 = E ( a 2 X 2 + 2 a b X + b 2 ) − ( a E ( X ) + b ) 2 = a 2 E ( X 2 ) + 2 a b E ( X ) + b 2 − a 2 E 2 ( X ) − a b E ( X ) − b 2 = a 2 E ( X 2 ) − a 2 E 2 ( X ) = a 2 V a r ( X ) ■
Proof : V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) Var(X+Y) = Var(X)+Var(Y)+2Cov(X,Y) V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y )
V a r ( X + Y ) = E [ ( X + Y ) 2 ] − E 2 [ X + Y ] = E [ X 2 + X Y + Y 2 ] − ( E [ X ] + E [ Y ] ) 2 = E [ X 2 ] + E [ X Y ] + E [ Y 2 ] − E 2 [ X ] − 2 E [ X ] E [ Y ] − E 2 [ Y ] = ( E [ X 2 ] − E 2 [ X ] ) + ( E [ Y 2 ] − E 2 [ Y ] ) + ( E [ X Y ] − 2 E [ X ] E [ Y ] ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) ■ \begin{aligned}
Var(X+Y) &= \mathbb{E}[(X+Y)^2] - E^2[X+Y] \\
&=\mathbb{E}[X^2 + XY + Y^2] - (\mathbb{E}[X]+\mathbb{E}[Y])^2 \\
&=\mathbb{E}[X^2] + \mathbb{E}[XY] + \mathbb{E}[Y^2] - E^2[X] - 2\mathbb{E}[X]\mathbb{E}[Y] - E^2[Y] \\
&=(\mathbb{E}[X^2]-E^2[X]) + (\mathbb{E}[Y^2]-E^2[Y]) + (\mathbb{E}[XY]-2\mathbb{E}[X]\mathbb{E}[Y]) \\
&=Var(X) + Var(Y) + 2 Cov(X,Y) \quad\quad\blacksquare
\end{aligned}
V a r ( X + Y ) = E [ ( X + Y ) 2 ] − E 2 [ X + Y ] = E [ X 2 + X Y + Y 2 ] − ( E [ X ] + E [ Y ] ) 2 = E [ X 2 ] + E [ X Y ] + E [ Y 2 ] − E 2 [ X ] − 2 E [ X ] E [ Y ] − E 2 [ Y ] = ( E [ X 2 ] − E 2 [ X ] ) + ( E [ Y 2 ] − E 2 [ Y ] ) + ( E [ X Y ] − 2 E [ X ] E [ Y ] ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) ■
If X ⊥ ​ ​ ​ ⊥ Y X\perp \!\!\! \perp Y X ⊥ ⊥ Y , then C o v ( X , Y ) = 0 Cov(X,Y)=0 C o v ( X , Y ) = 0 and V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) Var(X+Y)=Var(X)+Var(Y) V a r ( X + Y ) = V a r ( X ) + V a r ( Y )
Proof :
C o v ( X , Y ) = E ( X Y ) − E ( X ) E ( Y ) we know: X ⊥ ​ ​ ​ ⊥ Y ⇒ E ( X Y ) = E ( X ) E ( Y ) Thus, C o v ( X , Y ) = 0 ⇒ V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) So we see independence ⇒ Covariance is 0: "uncorrelated" the converse is not true. C o v ( X , Y ) = 0 → independence \begin{aligned}
& Cov(X,Y)=\mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y) \\
& \text{we know:} \\
& X\ \perp\!\!\!\perp Y\Rightarrow \mathbb{E}(XY)=\mathbb{E}(X)\mathbb{E}(Y) \\
& \text{Thus, } Cov(X,Y)=0 \Rightarrow Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y) \\
& \text{So we see independence} \Rightarrow \text{Covariance is 0: "uncorrelated"} \\
& \text{the converse is not true.} \\
& Cov(X,Y)=0 \cancel{\rightarrow} \text{independence}
\end{aligned}
C o v ( X , Y ) = E ( X Y ) − E ( X ) E ( Y ) we know: X ⊥ ⊥ Y ⇒ E ( X Y ) = E ( X ) E ( Y ) Thus, C o v ( X , Y ) = 0 ⇒ V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) So we see independence ⇒ Covariance is 0: "uncorrelated" the converse is not true. C o v ( X , Y ) = 0 → independence
Remarks
We have E ( X + Y ) = E ( X ) + E ( Y ) \mathbb{E}(X+Y)=\mathbb{E}(X)+\mathbb{E}(Y) E ( X + Y ) = E ( X ) + E ( Y ) .
If X ⊥ ​ ​ ​ ⊥ Y X\perp \!\!\! \perp Y X ⊥ ⊥ Y , we also have:
E ( X Y ) = E ( X ) E ( Y ) \mathbb{E}(XY) = \mathbb{E}(X)\mathbb{E}(Y) E ( X Y ) = E ( X ) E ( Y ) , and
V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) Var(X+Y)=Var(X)+Var(Y) V a r ( X + Y ) = V a r ( X ) + V a r ( Y )
It's important to remember that the first result and the other two results are of very different nature. While E ( X + Y ) = E ( X ) + E ( Y ) \mathbb{E}(X+Y)=\mathbb{E}(X)+\mathbb{E}(Y) E ( X + Y ) = E ( X ) + E ( Y ) is a property of expectation and holds unconditionally;
the other two, E ( X Y ) = E ( X ) E ( Y ) \mathbb{E}(XY) = \mathbb{E}(X)\mathbb{E}(Y) E ( X Y ) = E ( X ) E ( Y ) and V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) Var(X+Y)=Var(X)+Var(Y) V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) , only hold if X ⊥ ​ ​ ​ ⊥ Y X\perp \!\!\! \perp Y X ⊥ ⊥ Y .
It is more appropriate to consider them as properties of independence rather than properties of expectation and variance
2.6. Indicator
A random variable I I I is called an indicator, if
I ( w ) = { 1 ω ∈ A 0 ω ∈ A I(w)= \begin{cases}
\begin{aligned}
& 1 & \omega\in A \\
& 0 & \omega\cancel{\in} A
\end{aligned}
\end{cases}
I ( w ) = { 1 0 ω ∈ A ω ∈ A
E ( I A ) = P ( A ) E(I_A) = P(A)
E ( I A ) = P ( A )
for some event A A A
For A A A given, I I I is also elevated as I A I_A I A
The most important property of indicator is its expectation gives the probability of the event E ( I A ) = P ( A ) \mathbb{E}(I_A)=\mathbb{P}(A) E ( I A ) = P ( A )
Proof :
P ( I A = 1 ) = P ( ω : I A ( ω = 1 ) ) = P ( ω : ω ∈ A ) = P ( A ) \begin{aligned}
\mathbb{P}(I_A=1)
&= \mathbb{P}({\omega:I_A(\omega=1)}) \\
&= \mathbb{P}({\omega:\omega\in A}) \\
&= \mathbb{P}(A)
\end{aligned}
P ( I A = 1 ) = P ( ω : I A ( ω = 1 ) ) = P ( ω : ω ∈ A ) = P ( A )
P ( I A = 0 ) = 1 − P ( A ) ⇒ E ( I A ) = 1 ⋅ P ( A ) + c ⋅ ( 1 − P ( A ) ) = P ( A ) \mathbb{P}(I_A=0) = 1- \mathbb{P}(A)
\Rightarrow \mathbb{E}(I_A)=1\cdot \mathbb{P}(A)+c\cdot (1-\mathbb{P}(A)) = \mathbb{P}(A)
P ( I A = 0 ) = 1 − P ( A ) ⇒ E ( I A ) = 1 ⋅ P ( A ) + c ⋅ ( 1 − P ( A ) ) = P ( A )
2.6.1. Example
we see I A ∼ B e r ( P ( A ) ) I_A\sim Ber(\mathbb{P}(A)) I A ∼ B e r ( P ( A ) )
Let X ∼ B i n ( n , p ) X\sim Bin(n,p) X ∼ B i n ( n , p ) , X X X is number of successes in n n n Bernoulli trials, each with probability p p p of success
⇒ X = I 1 + ⋯ + I n \Rightarrow X=I_1+\cdots+I_n
⇒ X = I 1 + ⋯ + I n
where I 1 , ⋯   , I n I_1,\cdots,I_n I 1 , ⋯ , I n are indicators for independent events. I i = 1 I_i=1 I i = 1 if th i i i the trial is a success. I i = 0 I_i=0 I i = 0 if the i i i th trial is a failure.
Hence I i I_i I i are idd (independent and identically distributed) r.v's
⇒ E ( X ) = E ( I 1 + ⋯ + I N ) = E ( I 1 ) + ⋯ + E ( I n ) = p + ⋯ + p = n ⋅ p \begin{aligned}
\Rightarrow \mathbb{E}(X)
&= \mathbb{E}(I_1+\cdots+I_N) \\
&= \mathbb{E}(I_1)+\cdots+\mathbb{E}(I_n) \\
&= p + \cdots + p = n\cdot p
\end{aligned}
⇒ E ( X ) = E ( I 1 + ⋯ + I N ) = E ( I 1 ) + ⋯ + E ( I n ) = p + ⋯ + p = n ⋅ p
V a r ( X ) = V a r ( I 1 + ⋯ + I n ) = V a r ( I 1 ) + ⋯ + V a r ( I n ) = n ⋅ V a r ( I i ) = n ⋅ p ( 1 − p ) \begin{aligned}
Var(X)
&= Var(I_1+\cdots+I_n) \\
&= Var(I_1)+\cdots+Var(I_n) \\
&= n\cdot Var(I_i) \\
&= n\cdot p(1-p)
\end{aligned}
V a r ( X ) = V a r ( I 1 + ⋯ + I n ) = V a r ( I 1 ) + ⋯ + V a r ( I n ) = n ⋅ V a r ( I i ) = n ⋅ p ( 1 − p )
V a r ( I 1 ) = E ( I 1 2 ) − ( E ( I 1 ) ) 2 = E ( I 1 ) − ( E ( I 1 ) ) 2 = p − p 2 = p ( 1 − p ) Var(I_1)=\mathbb{E}(I_1^2)-(\mathbb{E}(I_1))^2 = \mathbb{E}(I_1)-(\mathbb{E}(I_1))^2=p-p^2=p(1-p)
V a r ( I 1 ) = E ( I 1 2 ) − ( E ( I 1 ) ) 2 = E ( I 1 ) − ( E ( I 1 ) ) 2 = p − p 2 = p ( 1 − p )
2.6.1. Example 3
Let X X X be a r.v. taking values in non-negative integers, then
E ( X ) = ∑ n = 0 ∞ P ( X > n ) \mathbb{E}(X) = \sum_{n=0}^\infty P(X>n)
E ( X ) = n = 0 ∑ ∞ P ( X > n )
Proof :
Note that X = ∑ n = 0 ∞ I n X=\sum_{n=0}^\infty I_n X = ∑ n = 0 ∞ I n where I n = I x > n I_n = I_{x>n} I n = I x > n . (x > n x>n x > n is an event)
E ( X ) = E ( ∑ n = 0 ∞ I n ) = ∑ n = 0 ∞ E ( I n ) = ∑ n = 0 ∞ P ( X > n ) \begin{aligned}
\mathbb{E}(X) &= \mathbb{E}(\sum_{n=0}^\infty I_n) \\
&= \sum_{n=0}^\infty \mathbb{E}(I_n) \\
&= \sum_{n=0}^\infty P(X>n)
\end{aligned}
E ( X ) = E ( n = 0 ∑ ∞ I n ) = n = 0 ∑ ∞ E ( I n ) = n = 0 ∑ ∞ P ( X > n )
In particular, let X ∼ G e o ( p ) X\sim Geo(p) X ∼ G e o ( p ) . As we have seen, P ( X > n ) = ( 1 − p ) n ⇒ P(X>n)=(1-p)^n\Rightarrow P ( X > n ) = ( 1 − p ) n ⇒
E ( X ) = ∑ n = 0 ∞ P ( X > n ) = ∑ n = 0 ∞ ( 1 − p ) n = 1 1 − ( 1 − p ) = 1 p \begin{aligned}
\mathbb{E}(X) &= \sum_{n=0}^\infty P(X>n) \\
&= \sum_{n=0}^\infty (1-p)^n \\
&= \frac{1}{1-(1-p)} = \frac{1}{p}
\end{aligned}
E ( X ) = n = 0 ∑ ∞ P ( X > n ) = n = 0 ∑ ∞ ( 1 − p ) n = 1 − ( 1 − p ) 1 = p 1
2.7. Moment generating function
Definition : Let X X X be a r.v. Then the function M ( t ) = E ( e t X ) M(t)=\mathbb{E}(e^{tX}) M ( t ) = E ( e t X ) is called the moment generating function(mgf) of X X X , if the expectation exists for all t ∈ ( − h , h ) t\in (-h, h) t ∈ ( − h , h ) for some h > 0 h>0 h > 0 .
Remark : The mgf is not always well-defined. It is important to check the existence of the expectation.
2.7.1. Properties of mgf
Moment Generating Function generates moments
Theorem :
M ( 0 ) = 1 M(0) = 1 M ( 0 ) = 1
M ( k ) ( 0 ) = E ( X k ) , k = 1 , 2 , … M^{(k)}(0) = \mathbb{E}(X^k), k=1,2,\ldots M ( k ) ( 0 ) = E ( X k ) , k = 1 , 2 , … (M ( k ) = d k d t k M ( t ) ∣ t = 0 M^{(k)}=\frac{d^k}{dt^k} M(t)|_{t=0} M ( k ) = d t k d k M ( t ) ∣ t = 0 )
As a result, we have: M ( t ) = ∑ k = 0 ∞ M ( k ) ( 0 ) k ! t k = ∑ k = 0 ∞ E ∗ X k k ! t k M(t)=\sum_{k=0}^\infty \frac{M^{(k)}(0)}{k!} t^k = \sum_{k=0}^\infty \frac{E*X^k}{k!}t^k M ( t ) = ∑ k = 0 ∞ k ! M ( k ) ( 0 ) t k = ∑ k = 0 ∞ k ! E ∗ X k t k (a method to get moment of a r.v)
X ⊥ ​ ​ ​ ⊥ Y X\perp \!\!\! \perp Y X ⊥ ⊥ Y , with mgf's M x , M y M_x, M_y M x , M y . Let M X + Y M_{X+Y} M X + Y be the mgf of X + Y X+Y X + Y . thenM X + Y ( t ) = M X ( t ) M Y ( t ) M_{X+Y}(t)=M_X(t)M_Y(t)
M X + Y ( t ) = M X ( t ) M Y ( t )
M X + Y ( t ) = E ( e t ( X + Y ) ) = E ( e t x e t y ) = E ( e t x ) E ( e t y ) = M X ( t ) M Y ( t ) \begin{aligned}
M_{X+Y}(t)
&= \mathbb{E}(e^{t(X+Y)}) \\
&= \mathbb{E}(e^{tx}e^{ty}) \\
&= \mathbb{E}(e^{tx})\mathbb{E}(e^{ty}) \\
&= M_X(t)M_Y(t)
\end{aligned}
M X + Y ( t ) = E ( e t ( X + Y ) ) = E ( e t x e t y ) = E ( e t x ) E ( e t y ) = M X ( t ) M Y ( t )
The mgf completely determines the distribution of a r.v.
M X ( t ) = M Y ( t ) M_X(t)=M_Y(t) M X ( t ) = M Y ( t ) for all t ∈ ( − h , h ) t\in (-h,h) t ∈ ( − h , h ) for some h > 0 h>0 h > 0 , then X = d Y X\stackrel{d}{=}Y X = d Y . (= d \stackrel{d}{=} = d : have the same distribution)
Example: Let X ∼ P o i ( λ 1 ) X\sim Poi(\lambda_1) X ∼ P o i ( λ 1 ) , Y ∼ P o i ( λ 2 ) Y\sim Poi(\lambda_2) Y ∼ P o i ( λ 2 ) . X ⊥ ​ ​ ​ ⊥ Y X\perp \!\!\! \perp Y X ⊥ ⊥ Y . Find the distribution of X + Y X+Y X + Y
First, derive the mgf of a Poisson distribution.
M X ( t ) = E ( e t X ) = ∑ n = 0 ∞ e t n ⋅ P ( X = n ) = ∑ n = 0 ∞ e t n ⋅ λ 1 n n ! e − λ 1 = ∑ n = 0 ∞ ( e t ⋅ λ 1 ) n n ! ⋅ e − λ 1 we know that ∑ n = 0 ∞ ( e t λ 1 ) n n ! = e e t ⋅ λ 1 . (Since ( e t λ 1 n ) n ! e − e t λ 1 is the pmf of P o i ( e t λ 1 ) ) ⇒ M X ( t ) = e e t λ 1 e − λ 1 = e λ 1 ( e t − 1 ) , t ∈ R . ( e λ 1 ( e t − 1 ) is mgf of P o i ( λ 1 ) ) Similarly, M Y ( t ) = e λ 2 ( e t − 1 ) . \begin{aligned}
M_X(t)
&= \mathbb{E}(e^{tX}) \\
&= \sum_{n=0}^\infty e^{tn}\cdot P(X=n) \\
&= \sum_{n=0}^\infty e^{tn}\cdot \frac{\lambda_1^n}{n!}e^{-\lambda_1} \\
&= \sum_{n=0}^\infty \frac{(e^t\cdot\lambda_1)^n}{n!}\cdot e^{-\lambda_1} \\
\text{we know that } \sum_{n=0}^\infty \frac{(e^t\lambda_1)^n}{n!}=e^{e^t\cdot\lambda_1}.& \text{(Since } \frac{(e^t\lambda_1^n)}{n!}e^{-e^t\lambda_1} \text{ is the pmf of } Poi(e^t\lambda_1)) \\
\Rightarrow M_X(t)
&= e^{e^t\lambda_1}e^{-\lambda_1}=e^{\lambda_1(e^t-1)}, t\in \R. \text{(} e^{\lambda_1(e^t-1)} \text{ is mgf of } Poi(\lambda_1) \text{)}\\
\text{Similarly, }M_Y(t) &=e^{\lambda_2(e^t-1)}.
\end{aligned}
M X ( t ) we know that n = 0 ∑ ∞ n ! ( e t λ 1 ) n = e e t ⋅ λ 1 . ⇒ M X ( t ) Similarly, M Y ( t ) = E ( e t X ) = n = 0 ∑ ∞ e t n ⋅ P ( X = n ) = n = 0 ∑ ∞ e t n ⋅ n ! λ 1 n e − λ 1 = n = 0 ∑ ∞ n ! ( e t ⋅ λ 1 ) n ⋅ e − λ 1 (Since n ! ( e t λ 1 n ) e − e t λ 1 is the pmf of P o i ( e t λ 1 ) ) = e e t λ 1 e − λ 1 = e λ 1 ( e t − 1 ) , t ∈ R . ( e λ 1 ( e t − 1 ) is mgf of P o i ( λ 1 ) ) = e λ 2 ( e t − 1 ) .
We know that
M X + Y ( t ) = M X ( t ) M Y ( t ) = e λ 1 ( e t − 1 ) e λ 2 ( e − 1 ) = e ( λ 1 + λ 2 ) ( e t − 1 ) \begin{aligned}
M_{X+Y}(t)
&= M_X(t)M_Y(t) \\
&= e^{\lambda_1(e^t-1)}e^{\lambda_2(e^-1)} \\
&= e^{(\lambda_1+\lambda_2)(e^t-1)}
\end{aligned}
M X + Y ( t ) = M X ( t ) M Y ( t ) = e λ 1 ( e t − 1 ) e λ 2 ( e − 1 ) = e ( λ 1 + λ 2 ) ( e t − 1 )
This is the mgf of P o i ( λ 1 + λ 2 ) Poi(\lambda_1+\lambda_2) P o i ( λ 1 + λ 2 ) !
Since the mgf uniquely determines the distribution X + Y ∼ P o i ( λ 1 + λ 2 ) X+Y\sim Poi(\lambda_1+\lambda_2) X + Y ∼ P o i ( λ 1 + λ 2 )
In general, if X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n independent, X i ∼ P o i ( λ i ) X_i\sim Poi(\lambda_i) X i ∼ P o i ( λ i ) , then ∑ X i ∼ P o i ( ∑ λ i \sum X_i \sim Poi(\sum \lambda_i ∑ X i ∼ P o i ( ∑ λ i
2.7.2. Joint mgf
Definition : Let X , Y X,Y X , Y be r.v's. Then M ( t 1 , t 2 ) : = E ( e t 1 X + t 2 Y ) M(t_1, t_2):=\mathbb{E}(e^{t_1X+t_2Y}) M ( t 1 , t 2 ) : = E ( e t 1 X + t 2 Y ) is called the joint mgf of X X X and Y Y Y , if the expectation exists for all t 1 ∈ ( − h 1 , h 1 ) t_1\in(-h_1, h_1) t 1 ∈ ( − h 1 , h 1 ) , t 2 ∈ ( − h 2 , h 2 ) t_2\in(-h_2, h_2) t 2 ∈ ( − h 2 , h 2 ) for some h 1 , h 2 > 0 h_1, h_2 >0 h 1 , h 2 > 0 .
More generally, we can define M ( t 1 , … , t n ) = E ( e x p ( ∑ i = 1 n t i X i ) ) M(t_1,\ldots, t_n)=\mathbb{E}(exp(\sum_{i=1}^n t_iX_i)) M ( t 1 , … , t n ) = E ( e x p ( ∑ i = 1 n t i X i ) ) for r.v's X 1 , ⋯   , X n X_1, \cdots,X_n X 1 , ⋯ , X n , if the expectation exists for { ( t 1 , ⋯   , t n ) : t i ∈ ( − h i , h i ) , i = 1 , ⋯   , n } \{(t_1,\cdots, t_n):t_i\in(-h_i,h_i), i=1,\cdots,n\} { ( t 1 , ⋯ , t n ) : t i ∈ ( − h i , h i ) , i = 1 , ⋯ , n } for some { h i > 0 } , i = 1 , ⋯   , n \{h_i>0\}, i=1,\cdots, n { h i > 0 } , i = 1 , ⋯ , n
2.7.2.1. Properties of the joint mgf
M X ( t ) = E ( e t X ) = E ( e t X + 0 Y ) = M ( t , 0 ) M Y ( t ) = M ( 0 , t ) \begin{aligned}
M_X(t)
&= \mathbb{E}(e^{tX}) \\
&=\mathbb{E}(e^{tX+0Y}) \\
&=M(t,0) \\
M_Y(t)
&=M(0, t)
\end{aligned}
M X ( t ) M Y ( t ) = E ( e t X ) = E ( e t X + 0 Y ) = M ( t , 0 ) = M ( 0 , t )
∂ m + n ∂ t 1 m ∂ t 2 n M ( t 1 , t 2 ) ∣ ( 0 , 0 ) = E ( X m Y n ) the proof is similar to the single r.v. case \frac{\partial^{m+n}}{\partial t_1^m \partial t_2^n} M(t_1, t_2)|_{(0,0)} = \mathbb{E}(X^mY^n) \\
\text{the proof is similar to the single r.v. case}
∂ t 1 m ∂ t 2 n ∂ m + n M ( t 1 , t 2 ) ∣ ( 0 , 0 ) = E ( X m Y n ) the proof is similar to the single r.v. case
If X ⊥ ​ ​ ​ ⊥ Y X\perp \!\!\! \perp Y X ⊥ ⊥ Y , then M ( t 1 , t 2 ) = M X ( t 1 ) M Y ( t 2 ) M(t_1, t_2)=M_X(t_1)M_Y(t_2) M ( t 1 , t 2 ) = M X ( t 1 ) M Y ( t 2 )
Proof :M ( t 1 , t 2 ) = E ( e t 1 X + t 2 Y ) ( X ⊥ ​ ​ ​ ⊥ Y ) = E ( e t 1 X e t 2 Y ) = E ( e t 1 X ) ⋅ E ( e t 2 Y ) = M X ( t 1 ) ⋅ M Y ( t 2 ) \begin{aligned}
M(t_1, t_2) &= \mathbb{E}(e^{t_1X+t_2Y}) \\
(X\perp \!\!\! \perp Y) &= \mathbb{E}(e^{t_1X}e^{t_2Y}) \\
&= \mathbb{E}(e^{t_1X})\cdot \mathbb{E}(e^{t_2Y}) \\
&= M_X(t_1)\cdot M_Y(t_2)
\end{aligned}
M ( t 1 , t 2 ) ( X ⊥ ⊥ Y ) = E ( e t 1 X + t 2 Y ) = E ( e t 1 X e t 2 Y ) = E ( e t 1 X ) ⋅ E ( e t 2 Y ) = M X ( t 1 ) ⋅ M Y ( t 2 )
Remark : Don't confuse this with the result X ⊥ ​ ​ ​ ⊥ Y ⇒ M X + Y ( t ) = M X ( t ) M Y ( t ) X\perp \!\!\! \perp Y\Rightarrow M_{X+Y}(t)=M_X(t)M_Y(t) X ⊥ ⊥ Y ⇒ M X + Y ( t ) = M X ( t ) M Y ( t ) .
M X + Y ( t ) → M_{X+Y}(t)\rightarrow M X + Y ( t ) → mgf of X + Y X+Y X + Y ; single argument function t t t
M ( t 1 , t 2 ) → M(t_1,t_2)\rightarrow M ( t 1 , t 2 ) → joint mgf of ( X , Y ) (X,Y) ( X , Y ) ; two arguments t 1 , t 2 t_1, t_2 t 1 , t 2
3. Conditional distribution and conditional expectation
3.1. Conditional distribution
3.1.1. Discrete case
Definition Let X X X and Y Y Y be discrete r.v's. The conditional distribution of X X X given Y Y Y is given by:
P ( X = x ∣ Y = y ) = ( P ( X = x , Y = u ) ) P ( Y = y ) P(X=x | Y=y) = \frac{(P(X=x, Y=u))}{P(Y=y)}
P ( X = x ∣ Y = y ) = P ( Y = y ) ( P ( X = x , Y = u ) )
P ( X = x ∣ Y = y ) : f X ∣ Y = y ( x ) . f X ∣ Y ( x ∣ y ) ← conditional probability mass function) P(X=x|Y=y): f_{X|Y = y}(x).\quad\quad f_{X|Y}(x|y) \leftarrow \text{conditional probability mass function)} P ( X = x ∣ Y = y ) : f X ∣ Y = y ( x ) . f X ∣ Y ( x ∣ y ) ← conditional probability mass function)
Conditional pmf is a legitimate pmf: given any y y y , f X ∣ Y = y ( x ) ≥ 0 , ∀ x f_{X|Y=y}(x) \geq 0, \forall x f X ∣ Y = y ( x ) ≥ 0 , ∀ x
∑ x f X ∣ Y = y ( x ) = 1 \sum_x f_{X|Y=y}(x) = 1
x ∑ f X ∣ Y = y ( x ) = 1
Note that given Y = y Y=y Y = y , as x x x changes, the value of the function f X ∣ Y = y ( x ) f_{X|Y=y}(x) f X ∣ Y = y ( x ) is proportional to the joint probability.
f X ∣ Y = y ( x ) ∝ P ( X = x , Y = y ) f_{X|Y=y}(x) \propto P(X=x, Y=y)
f X ∣ Y = y ( x ) ∝ P ( X = x , Y = y )
This is useful for solving problems where the denominator P ( Y = y ) P(Y=y) P ( Y = y ) is hard to find.
3.1.1.1. Example
X 1 ∼ P o i ( λ 1 ) , X 2 ∼ P o i ( λ 2 ) X_1\sim Poi(\lambda_1), X_2\sim Poi(\lambda_2) X 1 ∼ P o i ( λ 1 ) , X 2 ∼ P o i ( λ 2 ) . X 1 ⊥ ​ ​ ​ ⊥ X 2 X_1\perp\!\!\!\perp X_2 X 1 ⊥ ⊥ X 2 , Y = X 1 + X 2 Y=X_1 + X_2 Y = X 1 + X 2
Q: P ( X 1 = k ∣ Y = n ) P(X_1=k|Y=n) P ( X 1 = k ∣ Y = n ) ?
Note P ( X 1 = k ∣ Y = n ) = f X 1 ∣ Y = n ( k ) P(X_1=k|Y=n) = f_{X_1|Y=n}(k) P ( X 1 = k ∣ Y = n ) = f X 1 ∣ Y = n ( k )
A:
P ( X 1 = k ∣ Y = n ) P(X_1=k|Y=n) P ( X 1 = k ∣ Y = n ) can only be non-zero for k = 0 , ⋯   , n k=0, \cdots, n k = 0 , ⋯ , n in this case,
P ( X 1 = k ∣ Y = n ) = P ( X 1 = k , Y = n ) P ( Y = n ) ∝ P ( X 1 = k , Y = n ) = P ( X 1 = k , X 2 = n − k ) = e − λ 1 λ 1 k k ! ⋅ e − λ 2 λ 2 n − k ( n − k ) ! ∝ ( λ 1 λ 2 ) k / k ! ( n − k ) ! \begin{aligned}
P(X_1=k|Y=n) &= \frac{P(X_1=k, Y=n)}{P(Y=n)} \\
& \propto P(X_1=k, Y=n) \\
& = P(X_1=k, X_2=n-k) \\
& = e^{-\lambda_1}\frac{\lambda_1^k}{k!}\cdot e^{-\lambda_2}\frac{\lambda_2^{n-k}}{(n-k)!} \\
& \propto (\frac{\lambda_1}{\lambda_2})^k / k!(n-k)!
\end{aligned}
P ( X 1 = k ∣ Y = n ) = P ( Y = n ) P ( X 1 = k , Y = n ) ∝ P ( X 1 = k , Y = n ) = P ( X 1 = k , X 2 = n − k ) = e − λ 1 k ! λ 1 k ⋅ e − λ 2 ( n − k ) ! λ 2 n − k ∝ ( λ 2 λ 1 ) k / k ! ( n − k ) !
we can get P ( X = k ∣ Y = n ) P(X=k|Y=n) P ( X = k ∣ Y = n ) by normalizing the above expression.
P ( X 1 = k , Y = n ) = ( λ 1 λ 2 ) k / k ! ( n − k ) ! ∑ k = 0 n ( λ 1 λ 2 ) k / k ! ( n − k ) ! \begin{aligned}
P(X_1=k, Y=n) = \frac{(\frac{\lambda_1}{\lambda_2})^k / k!(n-k)!}{\sum_{k=0}^n(\frac{\lambda_1}{\lambda_2})^k/k!(n-k)!}
\end{aligned}
P ( X 1 = k , Y = n ) = ∑ k = 0 n ( λ 2 λ 1 ) k / k ! ( n − k ) ! ( λ 2 λ 1 ) k / k ! ( n − k ) !
but then we will need to fine ∑ k = 0 n ( λ 1 λ 2 ) k / k ! ( n − k ) ! \sum_{k=0}^n(\frac{\lambda_1}{\lambda_2})^k/k!(n-k)! ∑ k = 0 n ( λ 2 λ 1 ) k / k ! ( n − k ) !
An easier way is to compare ∑ k = 0 n ( λ 1 λ 2 ) k / k ! ( n − k ) ! \sum_{k=0}^n(\frac{\lambda_1}{\lambda_2})^k/k!(n-k)! ∑ k = 0 n ( λ 2 λ 1 ) k / k ! ( n − k ) ! with the known results for common distribution. In particular, if X ∼ B i n ( n , p ) X\sim Bin(n, p) X ∼ B i n ( n , p )
P ( X = k ) = ( k n ) p k ( 1 − p ) n − k ∝ ( p 1 − p ) k / k ! ( n − k ) ! \begin{aligned}
P(X=k) &= (\stackrel{n}{k})p^k(1-p)^{n-k} \\
&\propto (\frac{p}{1-p})^k/k! (n-k)!
\end{aligned}
P ( X = k ) = ( k n ) p k ( 1 − p ) n − k ∝ ( 1 − p p ) k / k ! ( n − k ) !
⇒ P ( X 1 = k ∣ Y = n ) \Rightarrow P(X_1=k|Y=n) ⇒ P ( X 1 = k ∣ Y = n ) follows a binomial distributions with parameters n n n and p p p given by p 1 − p = λ 1 λ 2 ⇒ p = λ 1 λ 1 + λ 2 \frac{p}{1-p}=\frac{\lambda_1}{\lambda_2} \Rightarrow p=\frac{\lambda_1}{\lambda_1+\lambda_2} 1 − p p = λ 2 λ 1 ⇒ p = λ 1 + λ 2 λ 1
Thus, given Y = X 1 + X 2 = n Y=X_1+X_2=n Y = X 1 + X 2 = n , the conditional distribution of X 1 X_1 X 1 is binomial with parameter n n n and λ 1 λ 1 + λ 2 \frac{\lambda_1}{\lambda_1+\lambda_2} λ 1 + λ 2 λ 1
3.1.2. Continuous case
Definition : Let X X X and Y Y Y be continuous r.v's. The conditional distribution of X X X given Y Y Y is given by
f X ∣ Y ( x ∣ y ) = f X ∣ Y = y ( x ) = f ( x , y ) f Y ( y ) f_{X|Y}(x|y) = f_{X|Y=y}(x) = \frac{f(x,y)}{f_Y(y)}
f X ∣ Y ( x ∣ y ) = f X ∣ Y = y ( x ) = f Y ( y ) f ( x , y )
A conditional pdf is a legitimate pdf
f X ∣ Y ( x ∣ y ) ≥ 0 x , y ∈ R ∫ − ∞ ∞ f X ∣ Y ( x ∣ y ) d x = 1 , y ∈ R \begin{aligned}
f_{X|Y}(x|y) \geq 0\quad \quad\quad& x,y\in \R\\
\int_{-\infty}^\infty f_{X|Y}(x|y) dx = 1,\quad &y\in \R
\end{aligned}
f X ∣ Y ( x ∣ y ) ≥ 0 ∫ − ∞ ∞ f X ∣ Y ( x ∣ y ) d x = 1 , x , y ∈ R y ∈ R
3.1.2.1. Example
Suppose X ∼ E x p ( λ ) X\sim Exp(\lambda) X ∼ E x p ( λ ) , Y ∣ X = x ∼ E x p ( x ) = f Y ∣ X ( y ∣ x ) = x e − x y , y = e Y|X=x \sim Exp(x) = f_{Y|X}(y|x) = x e^{-xy}, y = e Y ∣ X = x ∼ E x p ( x ) = f Y ∣ X ( y ∣ x ) = x e − x y , y = e ← \leftarrow ← conditional distribution of Y Y Y given X = x X=x X = x
Q: Find the condition pdf f X ∣ Y ( x ∣ y ) f_{X|Y}(x|y) f X ∣ Y ( x ∣ y )
A:
f X ∣ Y ( x ∣ y ) = f ( x , y ) f Y ( y ) ∝ f ( x , y ) = f Y ∣ X ( y ∣ x ) ⋅ f X ( x ) = x e − x y λ e − λ x ∝ x e − x ( y + λ ) , x > 0 , y > 0 \begin{aligned}
f_{X|Y}(x|y) &= \frac{f(x,y)}{f_Y(y)} \\
& \propto f(x,y) \\
& = f_{Y|X}(y|x)\cdot f_X(x) \\
& = xe^{-xy}\lambda e^{-\lambda x} \\
& \propto xe^{-x(y+\lambda)}, \quad\quad x>0, y>0
\end{aligned}
f X ∣ Y ( x ∣ y ) = f Y ( y ) f ( x , y ) ∝ f ( x , y ) = f Y ∣ X ( y ∣ x ) ⋅ f X ( x ) = x e − x y λ e − λ x ∝ x e − x ( y + λ ) , x > 0 , y > 0
Normalization ( make the total probability 1)
f X ∣ Y ( x ∣ y ) = x e − x ( y + λ ) ∫ 0 ∞ x e − x ( y + λ ) d x ∫ 0 ∞ x e − x ( y + λ ) d x = ( 1 λ + y ) 2 ← integration by parts \begin{aligned}
f_{X|Y}(x|y) & = \frac{xe^{-x(y+\lambda)}}{\int_0^\infty xe^{-x(y+\lambda)}dx} \\
\int_0^\infty xe^{-x(y+\lambda)}dx &= (\frac{1}{\lambda + y})^2 \leftarrow \text{integration by parts}
\end{aligned}
f X ∣ Y ( x ∣ y ) ∫ 0 ∞ x e − x ( y + λ ) d x = ∫ 0 ∞ x e − x ( y + λ ) d x x e − x ( y + λ ) = ( λ + y 1 ) 2 ← integration by parts
Thus, f X ∣ Y ( x ∣ y ) = ( λ + y ) 2 x e − x ( y + λ ) f_{X|Y}(x|y) = (\lambda + y)^2xe^{-x(y+\lambda)} f X ∣ Y ( x ∣ y ) = ( λ + y ) 2 x e − x ( y + λ ) , x > 0 x>0 x > 0 .
This is a gamma distribution with parameters γ \gamma γ and λ + y \lambda + y λ + y
3.1.2.1. Example 2
Find the distribution of Z = X Y Z=XY Z = X Y .
Attention : the following method is wrong:
f Z ( z ) = ∫ 0 ∞ f Y ∣ X ( z x ∣ x ) ⋅ f X ( x ) d x f_Z(z) = \int_0^\infty f_{Y|X}(\frac{z}{x}|x)\cdot f_X(x)dx
f Z ( z ) = ∫ 0 ∞ f Y ∣ X ( x z ∣ x ) ⋅ f X ( x ) d x
If we want to directly work with pdf's, we will need to use the change of variable formula for multi-variables. The right formula have turns out to be
f Z ( z ) = ∫ 0 ∞ f X , Z ( x , z ) d x = ∫ 0 ∞ f Z ∣ X ( z ∣ x ) f X ( x ) d x = ∫ 0 ∞ f ( x , z x ) ⋅ 1 x d x = f Y ∣ X ( z x ∣ x ) f X ( x ) ⋅ 1 x d x \begin{aligned}
f_Z(z) & = \int_0^\infty f_{X,Z}(x,z)dx = \int_0^\infty f_{Z|X}(z|x) f_X(x)dx \\
&= \int_0^{\infty} f(x, \frac{z}{x})\cdot \frac{1}{x}dx \\
&= f_{Y|X}(\frac{z}{x}|x) f_X(x)\cdot\frac{1}{x}dx
\end{aligned}
f Z ( z ) = ∫ 0 ∞ f X , Z ( x , z ) d x = ∫ 0 ∞ f Z ∣ X ( z ∣ x ) f X ( x ) d x = ∫ 0 ∞ f ( x , x z ) ⋅ x 1 d x = f Y ∣ X ( x z ∣ x ) f X ( x ) ⋅ x 1 d x
As an easier way is to use cdf, which gives probability rather than density:
P ( Z < z ) = P ( X Y ≤ z ) = ∫ 0 ∞ P ( X Y ≤ z ∣ X = x ) f X ( x ) d x ( law of total probability ) = ∫ 0 ∞ P ( Y ≤ z x ∣ X = x ) ⋅ f X ( x ) d x Y ∣ X = x ∼ E x p ( x ) = ∫ 0 ∞ ( 1 − e − x ⋅ z x ) ⋅ λ e − λ x d x = 1 − e − z ∫ 0 ∞ λ e − λ x d x = 1 − e − z ⇒ Z ∼ E x p ( 1 ) \begin{aligned}
P(Z<z) & = P(XY\leq z) \\
& = \int_0^\infty P(XY\leq z|X=x) f_X(x) dx \quad\quad (\text{law of total probability}) \\
&= \int_0^\infty P(Y\leq\frac{z}{x}|X=x)\cdot f_X(x)dx \\
Y|X=x \sim Exp(x) \\
&= \int_0^\infty(1-e^{-x\cdot\frac{z}{x}})\cdot\lambda e^{-\lambda x} dx \\
&= 1 - e^{-z}\int_0^\infty \lambda e^{-\lambda x}dx \\
&= 1-e^{-z}
\Rightarrow Z\sim Exp(1)
\end{aligned}
P ( Z < z ) Y ∣ X = x ∼ E x p ( x ) = P ( X Y ≤ z ) = ∫ 0 ∞ P ( X Y ≤ z ∣ X = x ) f X ( x ) d x ( law of total probability ) = ∫ 0 ∞ P ( Y ≤ x z ∣ X = x ) ⋅ f X ( x ) d x = ∫ 0 ∞ ( 1 − e − x ⋅ x z ) ⋅ λ e − λ x d x = 1 − e − z ∫ 0 ∞ λ e − λ x d x = 1 − e − z ⇒ Z ∼ E x p ( 1 )
Notation X , Y ∣ { Z = k } ∼ i i d ⋯ X,Y|\{ Z = k \}\stackrel{iid}{\sim}\cdots X , Y ∣ { Z = k } ∼ i i d ⋯ means that given Z = k Z=k Z = k , X X X and Y Y Y are conditionally independent , and they follow certain distribution.
(the conditional joint cdf/pmf/pdf equals the predict of the conditional cdf's/pmf's/pdf's)
3.2. Conditional expectation
We have seen that conditional pmf/pdf are legitimate pmf/pdf. Correspondingly, a conditional distribution is nothing else but a probability distributions. It is simply a (potentially) different distribution, since it takes more information into consideration.
As a result, we can define everything which are previously defined for unconditional distributions also for conditional distributions.
In particular, it is natural to define the conditional expectation.
Definition . The conditional expectation of g ( X ) g(X) g ( X ) given Y = y Y=y Y = y is defined as
E ( g ( X ) ∣ Y = y ) = { ∑ i 1 ∞ g ( x i ) P ( X = x u ∣ Y = y ) if X ∣ Y = y is discrete ∫ − ∞ ∞ g ( x ) f X ∣ Y ( x ∣ y ) d x if X ∣ X = y is continuous \mathbb{E}(g(X)|Y=y) = \begin{cases}
\sum_{i_1}^\infty g(x_i) P(X=x_u|Y=y) \quad\quad \text{ if } X|Y=y \text{ is discrete} \\
\int_{-\infty}^\infty g(x)f_{X|Y}(x|y) dx \quad\quad\quad\quad\quad \text{ if } X|X=y \text{ is continuous}
\end{cases}
E ( g ( X ) ∣ Y = y ) = { ∑ i 1 ∞ g ( x i ) P ( X = x u ∣ Y = y ) if X ∣ Y = y is discrete ∫ − ∞ ∞ g ( x ) f X ∣ Y ( x ∣ y ) d x if X ∣ X = y is continuous
Fix y y y , the conditional expectation is nothing but the expectation taken under the conditional distribution.
3.2.1. What is E ( X ∣ Y )    ? \mathbb{E}(X|Y)\;? E ( X ∣ Y ) ?
Different ways to understand conditional expectation
Fix a value y y y , E ( g ( X ) ∣ Y = y ) \mathbb{E}(g(X)|Y=y) E ( g ( X ) ∣ Y = y ) is a number
As y y y changes E ( g ( x ) ∣ Y = y ) \mathbb{E}(g(x)|Y=y) E ( g ( x ) ∣ Y = y ) becomes a function of y y y (that each y y y gives a value): h ( y ) = : E ( g ( x ) ∣ Y = y ) h(y) =: \mathbb{E}(g(x)|Y=y) h ( y ) = : E ( g ( x ) ∣ Y = y )
since y y y is actually random, we can define E ( g ( x ) ∣ Y ) = h ( Y ) \mathbb{E}(g(x)|Y)=h(Y) E ( g ( x ) ∣ Y ) = h ( Y ) . This is a random variableE ( g ( x ) ∣ Y ) ) ( ω ) = E ( g ( x ) ∣ Y = Y ( ω ) \mathbb{E}(g(x)|Y))_{(\omega)} = \mathbb{E}(g(x)|Y=Y(\omega)
E ( g ( x ) ∣ Y ) ) ( ω ) = E ( g ( x ) ∣ Y = Y ( ω )
ω ∈ Ω \omega \in \Omega ω ∈ Ω this random variable takes value
E ( g ( x ) ∣ Y = y ) \mathbb{E}(g(x)|Y=y) E ( g ( x ) ∣ Y = y )
When Y = y Y=y Y = y Ω → R h ( Y ) ( ω ) = h ( Y ( ω ) ) \Omega \rightarrow \R \\ h(Y)_{(\omega)} = h(Y(\omega))
Ω → R h ( Y ) ( ω ) = h ( Y ( ω ) )
3.2.2. Properties of conditional expectation
Linearity (inherited from expectation)
E ( a X + b ∣ Y = y ) = a E ( X ∣ Y = y ) + b \mathbb{E}(aX+b | Y = y) = a\mathbb{E}(X|Y=y) +b
E ( a X + b ∣ Y = y ) = a E ( X ∣ Y = y ) + b
E ( X + Z ∣ Y = y ) = E ( X ∣ Y = y ) + E ( Z ∣ Y = y ) \mathbb{E}(X+Z|Y=y) = \mathbb{E}(X|Y=y)+\mathbb{E}(Z|Y=y)
E ( X + Z ∣ Y = y ) = E ( X ∣ Y = y ) + E ( Z ∣ Y = y )
E ( g ( X , Y ) ∣ Y = y ) = E ( g ( X , y ) ∣ Y = y ) = E ( g ( X , y ) ) when X and Y are not independent \mathbb{E}(g(X,Y)|Y=y) = \mathbb{E}(g(X,y)|Y=y) \cancel{=} \mathbb{E}(g(X,y)) \text{ when X and Y are not independent} E ( g ( X , Y ) ∣ Y = y ) = E ( g ( X , y ) ∣ Y = y ) = E ( g ( X , y ) ) when X and Y are not independent
Proof (Discrete):
E ( g ( X , Y ) ∣ Y = y ) = ∑ x i ∑ y j g ( x i , y j ) ⋅ P ( X = x i , Y = y j ∣ Y = y ) P ( X = x i , Y = y j ∣ Y = y ) = { 0 if y j = y P ( X = x 1 , Y = y j ) / P ( Y + y ) = P ( X = x i ∣ Y = y ) if y j = y ⇒ E ( g ( X , Y ) ∣ Y = y ) = ∑ x i g ( x i , y ) ⋅ P ( X = x i ∣ Y = y ) = E ( g ( X , y ) ∣ Y = y ) g ( X , y ) regarded as a function of x \mathbb{E}(g(X,Y)|Y=y) =\sum_{x_i}\sum_{y_j}g(x_i,y_j)\cdot P(X=x_i,Y=y_j|Y=y) \\ \\
P(X=x_i,Y=y_j|Y=y) =
\begin{cases}
\begin{aligned}
&0 & \text{ if } y_j\cancel{=}y \\ \\
&P(X=x_1,Y=y_j)/P(Y+y)=P(X=x_i|Y=y) \quad\quad&\text{ if } y_j=y
\end{aligned}
\end{cases}\\
\begin{aligned} \\\\
\Rightarrow \mathbb{E}(g(X,Y)|Y=y)
&= \sum_{x_i}g(x_i,y)\cdot P(X=x_i|Y=y) \\
&= \mathbb{E}(g(X,y)|Y=y) &g(X,y)\text{ regarded as a function of } x
\end{aligned}
E ( g ( X , Y ) ∣ Y = y ) = x i ∑ y j ∑ g ( x i , y j ) ⋅ P ( X = x i , Y = y j ∣ Y = y ) P ( X = x i , Y = y j ∣ Y = y ) = ⎩ ⎪ ⎨ ⎪ ⎧ 0 P ( X = x 1 , Y = y j ) / P ( Y + y ) = P ( X = x i ∣ Y = y ) if y j = y if y j = y ⇒ E ( g ( X , Y ) ∣ Y = y ) = x i ∑ g ( x i , y ) ⋅ P ( X = x i ∣ Y = y ) = E ( g ( X , y ) ∣ Y = y ) g ( X , y ) regarded as a function of x
In particular,
E ( g ( X ) ⋅ h ( Y ) ∣ Y = y ) = h ( y ) E ( g ( X ) ∣ Y = y ) \mathbb{E}(g(X)\cdot h(Y) | Y=y) = h(y) \mathbb{E}(g(X)|Y=y)
E ( g ( X ) ⋅ h ( Y ) ∣ Y = y ) = h ( y ) E ( g ( X ) ∣ Y = y )
E ( g ( X ) ⋅ h ( Y ) ∣ Y ) = h ( Y ) E ( g ( X ) ∣ Y ) \mathbb{E}(g(X)\cdot h(Y)|Y) = h(Y)\mathbb{E}(g(X)|Y)
E ( g ( X ) ⋅ h ( Y ) ∣ Y ) = h ( Y ) E ( g ( X ) ∣ Y )
If X ⊥ Y X\perp Y X ⊥ Y , then E ( g ( X ) ∣ Y = y ) = E ( g ( X ) ) \mathbb{E}(g(X)|Y=y) = \mathbb{E}(g(X)) E ( g ( X ) ∣ Y = y ) = E ( g ( X ) )
Fact : If X ⊥ Y X\perp Y X ⊥ Y , then conditional distribution of X X X given Y = y Y=y Y = y is the same as the unconditional distribution of X X X
Proof (Discrete):
if X ⊥ Y P ( X = x i ∣ Y = y j ) = P ( X = x i ∣ Y = y j ) / P ( Y = y j ) = P ( X = x i ) P ( Y = y j ) / P ( Y = y j ) = P ( X = x i ) \begin{aligned}
& \text{if } X\perp Y \\
& P(X=x_i|Y=y_j) \\
&\quad= P(X=x_i|Y=y_j) / P(Y=y_j) \\
&\quad= P(X=x_i)P(Y=y_j)/P(Y=y_j) \\
&\quad=P(X=x_i)
\end{aligned}
if X ⊥ Y P ( X = x i ∣ Y = y j ) = P ( X = x i ∣ Y = y j ) / P ( Y = y j ) = P ( X = x i ) P ( Y = y j ) / P ( Y = y j ) = P ( X = x i )
Law of iterated expectation (or double expectation): Expectation of conditionally expectation is its unconditional expectation
E ( E ( X ∣ Y ) ) ) = E ( X ) \mathbb{E}(\mathbb{E}(X|Y)))=\mathbb{E}(X)
E ( E ( X ∣ Y ) ) ) = E ( X )
E ( X ∣ Y ) \mathbb{E}(X|Y) E ( X ∣ Y ) is a r.v, a function of Y Y Y .
Proof (Discrete):
When Y = y j Y=y_j Y = y j , the r.v. E ( X ∣ Y ) = E ( X ∣ Y = y j ) = ∑ x i x i P ( X = x i ∣ Y = y j ) \mathbb{E}(X|Y)=\mathbb{E}(X|Y=y_j) = \sum_{x_i}x_iP(X=x_i|Y=y_j) E ( X ∣ Y ) = E ( X ∣ Y = y j ) = ∑ x i x i P ( X = x i ∣ Y = y j ) . This happens with probability P ( Y = y j ) P(Y=y_j) P ( Y = y j )
⇒ E ( E ( X ∣ Y ) ) = ∑ y j ( ∑ x i x i P ( X = x i ∣ Y = y j ) ) P ( Y = y j ) = ∑ x i ∑ y j x i P ( X = x i ∣ Y = y j ) P ( Y = y j ) = ∑ x i x i ∑ y j P ( X = x i ∣ Y = y j ) P ( Y = y j ) law of total probability = ∑ x i x i P ( X = x i ) = E ( X ) \Rightarrow \begin{aligned}
\mathbb{E}(\mathbb{E}(X|Y))
&= \sum_{y_j}(\sum_{x_i}x_iP(X=x_i|Y=y_j))P(Y=y_j) \\
&= \sum_{x_i}\sum_{y_j} x_iP(X=x_i|Y=y_j)P(Y=y_j) \\
&= \sum_{x_i}x_i\sum_{y_j}P(X=x_i|Y=y_j)P(Y=y_j) &\text{law of total probability} \\
&= \sum_{x_i}x_i P(X=x_i) = \mathbb{E}(X)
\end{aligned} ⇒ E ( E ( X ∣ Y ) ) = y j ∑ ( x i ∑ x i P ( X = x i ∣ Y = y j ) ) P ( Y = y j ) = x i ∑ y j ∑ x i P ( X = x i ∣ Y = y j ) P ( Y = y j ) = x i ∑ x i y j ∑ P ( X = x i ∣ Y = y j ) P ( Y = y j ) = x i ∑ x i P ( X = x i ) = E ( X ) law of total probability
Alternatively,
∑ x i ∑ y j x i P ( X = x i ∣ Y = y j ) P ( Y = y j ) = ∑ x i ∑ y j x i P ( X = x i , Y = y j ) g ( X , Y ) = X at ( x i , y j ) = E ( X ) \begin{aligned}
&\sum_{x_i}\sum_{y_j} x_i P(X=x_i|Y=y_j)P(Y=y_j) \\
&\quad=\sum_{x_i}\sum_{y_j} x_i P(X=x_i,Y=y_j) & g(X,Y)=X \text{ at } (x_i, y_j)\\
&\quad= \mathbb{E}(X)
\end{aligned}
x i ∑ y j ∑ x i P ( X = x i ∣ Y = y j ) P ( Y = y j ) = x i ∑ y j ∑ x i P ( X = x i , Y = y j ) = E ( X ) g ( X , Y ) = X at ( x i , y j )
Example :
Y Y Y : # of claims received by insurance company
X X X : some random parameter
Y ∣ X ∼ P o i ( X ) , X ∼ E x p ( λ ) Y|X\sim Poi(X), X\sim Exp(\lambda)
Y ∣ X ∼ P o i ( X ) , X ∼ E x p ( λ )
a) E ( Y ) \mathbb{E}(Y) E ( Y ) ?
b) P ( Y = n ) P(Y=n) P ( Y = n ) ?
a)
Y ∣ X ∼ P o i ( X ) ⇒ E ( Y ∣ X = x ) = x ⇒ E ( Y ∣ X ) = X ∴ E ( Y ) = E ( E ( Y ∣ X ) ) = E ( X ) = 1 λ Y|X\sim Poi(X) \Rightarrow \mathbb{E}(Y|X=x) = x \Rightarrow \mathbb{E}(Y|X)=X \\
\therefore \begin{aligned} \\
\mathbb{E}(Y) &= \mathbb{E}(\mathbb{E}(Y|X)) \\
&= \mathbb{E}(X) = \frac{1}{\lambda}
\end{aligned}
Y ∣ X ∼ P o i ( X ) ⇒ E ( Y ∣ X = x ) = x ⇒ E ( Y ∣ X ) = X ∴ E ( Y ) = E ( E ( Y ∣ X ) ) = E ( X ) = λ 1
b)
P ( Y = n ) = ∫ 0 ∞ P ( Y = n ∣ X = x ) f X ( x ) d x = ∫ o ∞ e − x x n n ! ⋅ λ e − λ x d x = λ n ! ∫ 0 ∞ x n e − ( λ + 1 ) x d x = λ ( λ + 1 ) n + 1 n ! ∫ 0 ∞ ( ( λ + 1 ) x ) n e − ( λ + 1 ) x d ( λ + 1 ) x = λ ( λ + 1 ) n + 1 n ! Γ ( n + 1 ) Γ ( n + 1 ) = n ! ; formula for gamma function or integration by parts = λ ( λ + 1 ) n + 1 = ( 1 λ + 1 ) n ⋅ 1 λ + 1 ⇒ Y + 1 ∼ G e o ( λ / ( λ + 1 ) ) \begin{aligned}
P(Y=n) &= \int_0^\infty P(Y=n|X=x)f_X(x)dx \\
&= \int_o^\infty \frac{e^{-x}x^n}{n!}\cdot \lambda e^{-\lambda x} dx \\
&= \frac{\lambda}{n!}\int_0^\infty x^n e^{-(\lambda+1)x}dx \\
&= \frac{\lambda}{(\lambda+1)^{n+1}n!}\int_0^\infty((\lambda+1)x)^n e^{-(\lambda+1)x}d(\lambda+1)x \\
&= \frac{\lambda}{(\lambda+1)^{n+1}n!}\Gamma(n+1) &\Gamma(n+1) = n! \text{ ; formula for gamma function or integration by parts} \\
&= \frac{\lambda}{(\lambda+1)^{n+1}} = (\frac{1}{\lambda+1})^n\cdot\frac{1}{\lambda+1} \\
&\Rightarrow Y+1\sim Geo(\lambda/(\lambda+1))
\end{aligned}
P ( Y = n ) = ∫ 0 ∞ P ( Y = n ∣ X = x ) f X ( x ) d x = ∫ o ∞ n ! e − x x n ⋅ λ e − λ x d x = n ! λ ∫ 0 ∞ x n e − ( λ + 1 ) x d x = ( λ + 1 ) n + 1 n ! λ ∫ 0 ∞ ( ( λ + 1 ) x ) n e − ( λ + 1 ) x d ( λ + 1 ) x = ( λ + 1 ) n + 1 n ! λ Γ ( n + 1 ) = ( λ + 1 ) n + 1 λ = ( λ + 1 1 ) n ⋅ λ + 1 1 ⇒ Y + 1 ∼ G e o ( λ / ( λ + 1 ) ) Γ ( n + 1 ) = n ! ; formula for gamma function or integration by parts
We verify that E ( Y ) = λ + 1 λ − 1 = 1 λ \mathbb{E}(Y)=\frac{\lambda +1}{\lambda}-1=\frac{1}{\lambda} E ( Y ) = λ λ + 1 − 1 = λ 1
3.3. Decomposition of variance (EVVE's low)
Definition : The conditional variance of Y Y Y given X = x X=x X = x is defined as
V a r ( Y ∣ X = x ) = E ( ( Y − E ( Y ∣ X = x ) ) 2 ∣ X = x ) Var(Y|X=x)=\mathbb{E}((Y-\mathbb{E}(Y|X=x))^2|X=x)
V a r ( Y ∣ X = x ) = E ( ( Y − E ( Y ∣ X = x ) ) 2 ∣ X = x )
V a r ( Y ∣ X ) ( ω ) = V a r ( Y ∣ X = X ( ω ) ) V a r ( Y ∣ X ) ( ω ) : a r.v, a function of X Var(Y|X)_{(\omega)} = Var(Y|X=X_{(\omega)}) \quad \quad Var(Y|X)_{(\omega)} \text{: a r.v, a function of }X
V a r ( Y ∣ X ) ( ω ) = V a r ( Y ∣ X = X ( ω ) ) V a r ( Y ∣ X ) ( ω ) : a r.v, a function of X
The conditional variance is simply the variance taken under the conditional distribution
⇒ V ( Y ∣ X = x ) = E ( Y 2 ∣ X = x ) − ( E ( Y ∣ X = x ) ) 2 \Rightarrow V(Y|X=x)=\mathbb{E}(Y^2|X=x)-(\mathbb{E}(Y|X=x))^2
⇒ V ( Y ∣ X = x ) = E ( Y 2 ∣ X = x ) − ( E ( Y ∣ X = x ) ) 2
Thus
V a r ( Y ) = E ( V a r ( Y ∣ X ) ) + V a r ( E ( Y ∣ X ) ) Var(Y)=\mathbb{E}(Var(Y|X))+Var(\mathbb{E}(Y|X))
V a r ( Y ) = E ( V a r ( Y ∣ X ) ) + V a r ( E ( Y ∣ X ) )
E ( V a r ( Y ∣ X ) ) : "intra-group variance" V a r ( E ( Y ∣ X ) ) : "inter-group variance" \mathbb{E}(Var(Y|X)): \text{"intra-group variance" } \quad\quad Var(\mathbb{E}(Y|X)): \text{"inter-group variance"}
E ( V a r ( Y ∣ X ) ) : "intra-group variance" V a r ( E ( Y ∣ X ) ) : "inter-group variance"
Proof :
R H S = E ( E ( Y 2 ∣ X ) − ( E ( Y ∣ X ) ) 2 ) + E ( ( E ( Y ∣ X ) ) 2 ) − ( E ( E ( Y ∣ X ) ) ) 2 = E ( E ( Y 2 ∣ X ) ) − E ( ( E ( Y ∣ X ) ) 2 ) + E ( ( E ( Y ∣ X ) ) 2 ) − ( E ( E ( Y ∣ X ) ) ) 2 = E ( Y 2 ) − ( E ( Y ) ) 2 = V a r ( Y ) \begin{aligned}
RHS &= E(E(Y^2|X)-(E(Y|X))^2) + E((E(Y|X))^2) - (E(E(Y|X)))^2 \\
&=E(E(Y^2|X)) - \sout{E((E(Y|X))^2)} + \sout{E((E(Y|X))^2)} - (E(E(Y|X)))^2 \\
&=E(Y^2) - (E(Y))^2 \\
&=Var(Y)
\end{aligned}
R H S = E ( E ( Y 2 ∣ X ) − ( E ( Y ∣ X ) ) 2 ) + E ( ( E ( Y ∣ X ) ) 2 ) − ( E ( E ( Y ∣ X ) ) ) 2 = E ( E ( Y 2 ∣ X ) ) − E ( ( E ( Y ∣ X ) ) 2 ) + E ( ( E ( Y ∣ X ) ) 2 ) − ( E ( E ( Y ∣ X ) ) ) 2 = E ( Y 2 ) − ( E ( Y ) ) 2 = V a r ( Y )
4. Stochastic Processes
sequence / family of random variables
a random function (hard to formulate)
Definition : A stochastic process { X t } t ∈ T \{X_t\}_{t\in T} { X t } t ∈ T is a collection of random variables, defined on a common probability space.
T T T : index set. In most cases, T T T corresponds to time, and is either discrete { 0 , 1 , 2 , ⋯   } \{0,1,2,\cdots\} { 0 , 1 , 2 , ⋯ } or continuous [ 0 , ∞ ) [0,\infty) [ 0 , ∞ )
In discrete case, we writes { X n } n = 0 , 1 , 2 , … \{X_n\}_{n=0,1,2,\ldots} { X n } n = 0 , 1 , 2 , …
This state space S S S os a stochastic process is the set of all possible value of X t , t ∈ T X_t, t\in T X t , t ∈ T
S S S can also be either discrete or continuous. In this course, we typically deal with discrete state space. Then we relabel the states so that S = { 0 , 1 , 2 , ⋯   } S=\{0,1,2,\cdots\} S = { 0 , 1 , 2 , ⋯ } (countable state space) or S = { 0 , 1 , 2 , ⋯   , M } S=\{0,1,2,\cdots,M\} S = { 0 , 1 , 2 , ⋯ , M } (finite state space)
Remark : As in the case of the joint distribution, we need the r.v's in a stochastic process to be defined on a common probability space, because we want to discuss their joint behaviours, i.t, how things change over time.
Thus, we can identify each point ω \omega ω in the sample space Ω \Omega Ω with a function defined on T T T and taking value in S S S . Each function is called a path of this stochastic process
Example Let X 0 , X 1 , ⋯ X_0, X_1, \cdots X 0 , X 1 , ⋯ be independent and identical r.v's following some distribution. Then { X n } n = 0 , 1 , 2 , . . . \{X_n\}_{n=0,1,2,...} { X n } n = 0 , 1 , 2 , . . . is a stochastic process
Example Let X 1 , X 2 , . . . X_1, X_2,... X 1 , X 2 , . . . be independent and identical r.v.'s. P ( X 1 = 1 ) = p P(X_1=1)=p P ( X 1 = 1 ) = p , and P ( X 1 = − 1 ) = 1 − p P(X_1=-1)=1-p P ( X 1 = − 1 ) = 1 − p . Define S 0 = 0 , S n = ∑ i = 1 n X i , n ≤ 1 S_0=0, S_n=\sum_{i=1}^n X_i, n\leq 1 S 0 = 0 , S n = ∑ i = 1 n X i , n ≤ 1 , e.g.
S 0 = 0 S_0=0 S 0 = 0
S 1 = X 1 S_1=X_1 S 1 = X 1
S 2 = X 1 + X 2 S_2=X_1+X_2 S 2 = X 1 + X 2
⋯ ⋯ \cdots\cdots ⋯ ⋯
Then { S n } n = 0 , 1 , . . . \{S_n\}_{n=0,1,...} { S n } n = 0 , 1 , . . . is a stochastic process, with state space S = Z S=\Z S = Z (integer)
4.1. Markov Chain
4.1.1. Simple Random Walk
{ S n } n = 0 , 1 , . . . \{S_n\}_{n=0,1,...} { S n } n = 0 , 1 , . . . is called a "simple random walk" . (S n = S n − 1 + X n S_n=S_{n-1}+X_n S n = S n − 1 + X n )
S n = { S n − 1 + 1 S n − 1 − 1 S_n=\begin{cases}
S_{n-1} + 1 \\
S_{n-1} - 1
\end{cases}
S n = { S n − 1 + 1 S n − 1 − 1
Remark : Why we need the concept of "stochastic process"? Why don't we just look at the joint distribution of ( X 0 , X 1 , . . . , X n ) (X_0, X_1,...,X_n) ( X 0 , X 1 , . . . , X n ) ?
Answer : The joint distribution of a large number of r.v's is very complicated, because it does not take advantage of the special structure of T T T (time).
For example, simple random walk. The full distribution of ( S 0 , S 1 , . . . , S n ) (S_0, S_1, ..., S_n) ( S 0 , S 1 , . . . , S n ) is complicated for n n n large. However, the structure is actually simple if we focus on the adjacent times:
S n + 1 = S n + X n + 1 S_{n+1}=S_n+X_{n+1} S n + 1 = S n + X n + 1
S n : last value. X n + 1 : related to B e r ( p ) . They are independent S_n: \text{ last value. } \quad X_{n+1}: \text{related to }Ber(p). \text{ They are independent}
S n : last value. X n + 1 : related to B e r ( p ) . They are independent
By introducing time into the framework, we can greatly simplify many things.
More precisely, we find that for simple random walk, { S n } n = 0 , 1 , . . . \{S_n\}_{n=0,1,...} { S n } n = 0 , 1 , . . . , if we know S n S_n S n the distribution of S n + 1 S_n+1 S n + 1 will not depend on the history ( S 0 , . . . , S n − 1 ) (S_0, ..., S_{n-1}) ( S 0 , . . . , S n − 1 ) . This is a very useful property
In general for a stochastic process { X n } n = 0 , 1 , . . . \{X_n\}_{n=0,1,...} { X n } n = 0 , 1 , . . . , at time n n n , we already know X 0 , X 1 , . . . , X n X_0, X_1,..., X_n X 0 , X 1 , . . . , X n , S 0 S_0 S 0 ; our best estimate of the distribution of X n + 1 X_{n+1} X n + 1 should be the conditional distribution:
X n + 1 ∣ X n , . . . , X n X_{n+1}|X_n,...,X_n
X n + 1 ∣ X n , . . . , X n
given by:
P ( X n + 1 = x n + 1 ∣ X n = x n , . . . , X 0 = x 0 ) P(X_{n+1}=x_{n+1}|X_n=x_n,..., X_0=x_0)
P ( X n + 1 = x n + 1 ∣ X n = x n , . . . , X 0 = x 0 )
As time passes, the expression becomes more and more complicated → \rightarrow → impossible to handle.
However, if we know that this conditional distribution is actually the same as the conditional distribution only given X n X_n X n , then the structure will remain simple for any time. This motivates the notion of Markov chain .
4.1.2. Markov Chain
4.1.2.1. Discrete-time Markov Chain
Definition and Examples
Definition : A discrete-time Stochastic process { X n } n = 0 , 1 , . . . \{X_n\}_{n=0,1,...} { X n } n = 0 , 1 , . . . is called a discrete-time Markov Chain (DTMC) , if its state space S S S is discrete, and it has the Markov property:
P ( X n + 1 = x n + 1 ∣ X n = x n , . . . , X o = x o ) = P ( X n + 1 = x n + 1 ∣ X n = x n ) \begin{aligned}
&\quad P(X_{n+1}=x_{n+1}|X_n=x_n,...,X_o=x_o) \\
&= P(X_{n+1}=x_{n+1}|X_n=x_n)
\end{aligned}
P ( X n + 1 = x n + 1 ∣ X n = x n , . . . , X o = x o ) = P ( X n + 1 = x n + 1 ∣ X n = x n )
for all n , x 0 , . . . , x n , x n + 1 ∈ S n, x_0,...,x_n,x_{n+1}\in S n , x 0 , . . . , x n , x n + 1 ∈ S
If X n + 1 ∣ { x n = i } X_{n+1}|\{x_n=i\} X n + 1 ∣ { x n = i } does not change over time, P ( X n + 1 = j ∣ N n = i ) = P ( X 1 = j ∣ X 0 = i ) P(X_{n+1}=j|N_n=i)=P(X_1=j|X_0=i) P ( X n + 1 = j ∣ N n = i ) = P ( X 1 = j ∣ X 0 = i ) , then we call this Markov chain time-homogeneous (default setting for this course).
P ( X n + 1 = x n + 1 ∣ X n = x n , . . . , X 0 = x 0 ) X n + 1 = x n + 1 : future; X n = x n : present(state) = P ( X n + 1 = x x + 1 ∣ X n = x n ) X n − 1 = x n − 1 , . . . , X 0 = x 0 : past(history) \begin{aligned}
&\quad P(X_{n+1}=x_{n+1}|X_n=x_n,...,X_0=x_0 ) \quad\quad &X_{n+1}=x_{n+1}\text{: future; } X_{n}=x_{n}\text{: present(state)} \\
&= P(X_{n+1}=x_{x+1}|X_n=x_n) &X_{n-1}=x_{n-1},...,X_{0}=x_{0}\text{: past(history)}
\end{aligned}
P ( X n + 1 = x n + 1 ∣ X n = x n , . . . , X 0 = x 0 ) = P ( X n + 1 = x x + 1 ∣ X n = x n ) X n + 1 = x n + 1 : future; X n = x n : present(state) X n − 1 = x n − 1 , . . . , X 0 = x 0 : past(history)
Intuition : Given the present state, the past and the future are independent. In other words, the future depends on the previous results only through the current state.
Example: simple random walk
The simple random walk { S n } n = 0 , 1 , . . . \{S_n\}_{n=0,1,...} { S n } n = 0 , 1 , . . . is a Markov chain
Proof :
Recall that S n + 1 = S n + X n + 1 S_{n+1}=S_n+X_{n+1} S n + 1 = S n + X n + 1
if s n + 1 = s n ± 1 s_{n+1}\cancel{=} s_n \pm 1 s n + 1 = s n ± 1
P ( S n + 1 = s n + 1 ∣ S n = s n , . . . , S 0 = s 0 ) = 0 = P ( S n + 1 = s n + 1 ∣ S n = s n ) \begin{aligned}
&\quad P(S_{n+1}=s_{n+1}|S_n=s_n,...,S_0=s_0) \\
& = 0 \\
& = P(S_{n+1}=s_{n+1}|S_n=s_n)
\end{aligned}
P ( S n + 1 = s n + 1 ∣ S n = s n , . . . , S 0 = s 0 ) = 0 = P ( S n + 1 = s n + 1 ∣ S n = s n )
P ( S n + 1 = s n + 1 ∣ S n = s n , . . . , s 0 = 0 ) = P ( X n + 1 ∣ S n = s n , . . . , S 0 = 0 ) = P ( X n + 1 = 1 ) X n + 1 ⊥ ( X 1 , . . . , X n ) hence also ( S 0 , . . . , S n ) \begin{aligned}
&\quad P(S_{n+1}=s_n+1|S_n=s_n,...,s_0=0) \\
&=P(X_{n+1}|S_n=s_n,...,S_0=0) \\
&=P(X_{n+1}=1) \quad\quad\quad X_{n+1} \perp (X_1,...,X_n) \text{ hence also } (S_0, ..., S_n)
\end{aligned}
P ( S n + 1 = s n + 1 ∣ S n = s n , . . . , s 0 = 0 ) = P ( X n + 1 ∣ S n = s n , . . . , S 0 = 0 ) = P ( X n + 1 = 1 ) X n + 1 ⊥ ( X 1 , . . . , X n ) hence also ( S 0 , . . . , S n )
Similarly,
P ( S n + 1 = s n + 1 ∣ S n = s n ) = P ( X n + 1 = 1 ∣ S n = s n ) = P ( X n + 1 = 1 ) ⇒ P ( S n + 1 ∣ S n = s n , . . . , S 0 = s 0 ) \begin{aligned}
&\quad P(S_{n+1}=s_n+1|S_n=s_n) \\
&=P(X_{n+1}=1|S_n=s_n) \\
&=P(X_{n+1}=1) \\
&\Rightarrow P(S_{n+1}|S_n=s_n,...,S_0=s_0)
\end{aligned}
P ( S n + 1 = s n + 1 ∣ S n = s n ) = P ( X n + 1 = 1 ∣ S n = s n ) = P ( X n + 1 = 1 ) ⇒ P ( S n + 1 ∣ S n = s n , . . . , S 0 = s 0 )
Similarly,
P ( S n + 1 = s n − 1 ∣ S n = s n , . . . , S 0 = 0 ) = P ( S n + 1 = s n − 1 ∣ S n = s n ) = P ( X n + 1 = − 1 ) ⇒ { S n } n = 0 , 1 , . . . is a DTMC ■ \begin{aligned}
&\quad P(S_{n+1}=s_n-1|S_n=s_n,...,S_0=0) \\
&=P(S_{n+1}=s_n-1|S_n=s_n) \\
&=P(X_{n+1}=-1) \\
&\Rightarrow \{S_n\}_{n=0,1,...} \text{ is a DTMC} \quad\quad\blacksquare
\end{aligned}
P ( S n + 1 = s n − 1 ∣ S n = s n , . . . , S 0 = 0 ) = P ( S n + 1 = s n − 1 ∣ S n = s n ) = P ( X n + 1 = − 1 ) ⇒ { S n } n = 0 , 1 , . . . is a DTMC ■
4.1.3. One-step transition probability matrix
For a time-homogeneous DTMC, define
P i j = P ( X 1 = j ∣ X 0 = i ) = P ( X n + 1 = j ∣ X n = i ) n = 0 , 1 , . . . \begin{aligned}
P_{ij} &= P(X_1=j|X_0=i) \\
&= P(X_{n+1}=j|X_n=i) \quad\quad n=0,1,...
\end{aligned}
P i j = P ( X 1 = j ∣ X 0 = i ) = P ( X n + 1 = j ∣ X n = i ) n = 0 , 1 , . . .
P i j P_{ij} P i j : one step transition probability
The collection of P i j , i , j ∈ S P_{ij}, i,j\in S P i j , i , j ∈ S governs all the one-step transitions of the DTMC. Since it has two indices i i i and j j j ; it naturally forms a matrix P = { P i j } i , j ∈ S P=\{P_{ij}\}_{i,j\in S} P = { P i j } i , j ∈ S , called the (one-setp) transition (probability) matrix or transition matrix
Property of a transition matrix P = { P i j } i , j ∈ S P=\{P_{ij}\}_{i,j\in S} P = { P i j } i , j ∈ S :
P i j ≥ 0 ∀ i , j ∈ S ∑ j ∈ S P i j = 1 ∀ i ∈ S → the row some of P are all 1 \begin{aligned}
&P_{ij}\geq 0\quad \forall i,j\in S \\
&\sum_{j\in S}P_{ij}=1 \quad \forall i\in S \quad\rightarrow\text{ the row some of } P \text { are all }1
\end{aligned}
P i j ≥ 0 ∀ i , j ∈ S j ∈ S ∑ P i j = 1 ∀ i ∈ S → the row some of P are all 1
Reason :
∑ j ∈ S P i j = ∑ j ∈ S P ( X 1 = j ∣ X 0 = i ) = P ( X 1 ∈ S ∣ X o = i ) = 1 \begin{aligned}
\sum_{j\in S}P_{ij}
&=\sum_{j\in S} P(X_1=j|X_0=i) \\
&= P(X_1\in S|X_o=i) \\
&= 1
\end{aligned}
j ∈ S ∑ P i j = j ∈ S ∑ P ( X 1 = j ∣ X 0 = i ) = P ( X 1 ∈ S ∣ X o = i ) = 1
Example 4.1.3.1. simple random walk
There will be 3 cases:
P i , i + 1 = P ( S 1 = i + 1 ∣ S 0 = i ) = P ( X 1 = 1 ) = p P i , i − 1 = P ( S 1 = i − 1 ∣ S 0 = i ) = P ( X 1 = − 1 ) = 1 − p = : q P i , j = 0 for j = i ± 1 \begin{aligned}
& P_{i,i+1} = P(S_1=i+1|S_0=i) = P(X_1=1)= p \\
& P_{i,i-1} = P(S_1=i-1|S_0=i) = P(X_1=-1) = 1 - p =:q \\
& P_{i,j}=0 \quad\quad\quad\quad\text{ for } j\cancel{=}i\pm 1 \\
\end{aligned}
P i , i + 1 = P ( S 1 = i + 1 ∣ S 0 = i ) = P ( X 1 = 1 ) = p P i , i − 1 = P ( S 1 = i − 1 ∣ S 0 = i ) = P ( X 1 = − 1 ) = 1 − p = : q P i , j = 0 for j = i ± 1
⇒ (infinite dimension) p = { . . . . . . . . . . . . . . . . . . . . . . . . 0 p 0 . . . . . . . . . . . . q 0 p . . . . . . . . . . . . . . . q 0 p . . . . . . . . . . . . . . . q 0 p . . . . . . . . . . . . . . . . . . . . . . . . } \Rightarrow
\text{(infinite dimension)}p=\begin{Bmatrix}
... & ... & ... & ... & ... & ... & ...\\
... & 0 & p & 0 & ... & ... & ...\\
... & q & 0& p & ... & ... & ...\\
...& ... & q & 0 & p & ... & ... \\
... & ... & ... & q & 0 & p & ... \\
... & ... & ... & ... & ... & ... & ...
\end{Bmatrix}
⇒ (infinite dimension) p = ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ . . . . . . . . . . . . . . . . . . . . . 0 q . . . . . . . . . . . . p 0 q . . . . . . . . . 0 p 0 q . . . . . . . . . . . . p 0 . . . . . . . . . . . . . . . p . . . . . . . . . . . . . . . . . . . . . ⎭ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎫
Example 4.1.3.2. Ehrenfest's urn
Two urns A , B A, B A , B , total M M M balls. Each time, pick one ball randomly(uniformly), and move it to the opposite urn.
X n : # of balls in A after step n X_n: \# \text{ of balls in } A \text{ after step }n
X n : # of balls in A after step n
S = { 0 , 1 , . . . , M } S=\{0,1,...,M\}
S = { 0 , 1 , . . . , M }
P i j = P ( X 1 = j ∣ X 0 = j ) ( i balls in A , M − i balls in B ) = { i / M j = i − 1 ( M − i ) / M j = i + 1 0 j = i ± 1 \begin{aligned}
P_{ij}
& =P(X_1=j|X_0=j) \quad\quad\text{($i$ balls in $A$, $M-i$ balls in $B$)}\\
& =\begin{cases}
\begin{aligned}
& i/M \quad\quad\quad\quad\quad\quad&j=i-1\\
& (M-i)/M &j=i+1 \\
& 0 &j\cancel{=}i\pm 1
\end{aligned}
\end{cases}
\end{aligned}
P i j = P ( X 1 = j ∣ X 0 = j ) ( i balls in A , M − i balls in B ) = ⎩ ⎪ ⎨ ⎪ ⎧ i / M ( M − i ) / M 0 j = i − 1 j = i + 1 j = i ± 1
p = { 0 1 1 / M 0 ( M − 1 ) / M 1 / M 0 ( M − 1 ) / M 2 / M 0 ( M − 2 ) / M . . . . . . . . . . . . . . . . . . . . . ( M − 1 ) / M 0 1 / M 1 0 } p=\begin{Bmatrix}
0 & 1 \\
1/M & 0 & (M-1)/M \\
&1/M & 0 & (M-1)/M \\
&&2/M & 0 & (M-2)/M \\
...&...&...&...&...&...&...\\
&&&&(M-1)/M & 0 & 1/M \\
&&&&&1&0
\end{Bmatrix}
p = ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ 0 1 / M . . . 1 0 1 / M . . . ( M − 1 ) / M 0 2 / M . . . ( M − 1 ) / M 0 . . . ( M − 2 ) / M . . . ( M − 1 ) / M . . . 0 1 . . . 1 / M 0 ⎭ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎫
Example 4.1.3.3: Gambler's ruin
A gambler, each time wins 1 with probability p p p , losses 1 with probability 1 − p = q 1-p=q 1 − p = q . Initial wealth S 0 = a S_0=a S 0 = a ; wealth at time n n n : S n S_n S n . The gambler leaves if S n = 0 S_n=0 S n = 0 (loses all money) or S n = M > a S_n=M>a S n = M > a (wins certain amount of money and gets satisfied)
This is a variant of the simple random walk, where we have absorbing barriers(P i i = 1 P_{ii}=1 P i i = 1 ) at 0 0 0 and M M M
S = { 0 , . . . , M } S=\{0,...,M\}
S = { 0 , . . . , M }
P i j = { p j = i + 1 , i = 1 , . . . , M − 1 q j = i − 1 , i = 1 , . . . , M − 1 1 i = j = 0 or i = j = M 0 otherwise P_{ij}=\begin{cases}
\begin{aligned}
& p \quad\quad & j=i+1, i=1,...,M-1 \\
& q & j=i-1, i=1,...,M-1 \\
& 1 & i=j=0 \text{ or } i=j=M \\
& 0 & \text{otherwise}
\end{aligned}
\end{cases}
P i j = ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ p q 1 0 j = i + 1 , i = 1 , . . . , M − 1 j = i − 1 , i = 1 , . . . , M − 1 i = j = 0 or i = j = M otherwise
p = { 1 0 . . . q 0 p . . . . . . q 0 p . . . . . . q 0 p . . . . . . . . . . . . . . . . . . . . . . . . . . . q 0 p . . . 0 1 } p=\begin{Bmatrix}
1 & 0 &...& \\
q & 0 & p & ... \\
... & q & 0 & p & ... \\
&... & q & 0 & p & ... \\
...&...&...&...&...&...&...\\
& & & ... & q & 0 & p \\
& & & & ... & 0 & 1
\end{Bmatrix}
p = ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ 1 q . . . . . . 0 0 q . . . . . . . . . p 0 q . . . . . . p 0 . . . . . . . . . p . . . q . . . . . . . . . 0 0 . . . p 1 ⎭ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎫
Example 4.1.3.4: Bonus-Malus system
Insurance company has 4 premium levels: 1, 2, 3, 4
Let X n ∈ { 1 , 2 , 3 , 4 } X_n\in\{1,2,3,4\} X n ∈ { 1 , 2 , 3 , 4 } be the premium level for a customer at year n n n
Y n ∼ i i d P o i ( λ ) : # of claims at year n Y_n \stackrel{iid}{\sim}Poi(\lambda) : \text{ \# of claims at year $n$}
Y n ∼ i i d P o i ( λ ) : # of claims at year n
If Y n = 0 Y_n=0 Y n = 0 (no claims)
X n + 1 = m a x ( X n − 1 , 1 ) X_{n+1}=max(X_{n-1},1) X n + 1 = m a x ( X n − 1 , 1 )
If Y n > 0 Y_n>0 Y n > 0
X n + 1 = m i n ( X n + Y n , 4 ) X_{n+1}=min(X_{n} + Y_n,4) X n + 1 = m i n ( X n + Y n , 4 )
Denote a k = P ( Y n = k ) , k = 0 , 1 , . . . a_k=P(Y_n=k), k=0,1,... a k = P ( Y n = k ) , k = 0 , 1 , . . .
p = { a 0 a 1 a 2 ( 1 − a 0 − a 1 − a 2 ) a 0 0 a 1 ( 1 − a 0 − a 1 ) 0 a 0 0 ( 1 − a 0 ) 0 0 a 0 ( 1 − a 0 ) } p=\begin{Bmatrix}
a_0 & a_1 & a_2 & (1-a_0-a_1-a_2) \\
a_0 & 0 & a_1 & (1-a_0-a_1) \\
0 & a_0 & 0 & (1-a_0) \\
0 & 0 & a_0 & (1-a_0)
\end{Bmatrix}
p = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ a 0 a 0 0 0 a 1 0 a 0 0 a 2 a 1 0 a 0 ( 1 − a 0 − a 1 − a 2 ) ( 1 − a 0 − a 1 ) ( 1 − a 0 ) ( 1 − a 0 ) ⎭ ⎪ ⎪ ⎬ ⎪ ⎪ ⎫
4.2. Chapman-Kolmogorov equations
Q : Given the (one-step) transition matrix, P = { P i j } i , j ∈ S P=\{P_{ij}\}_{i,j\in S} P = { P i j } i , j ∈ S , how can we decide the n-step transition probability
P i j ( n ) : = P ( X n = j ∣ X 0 = i ) = P ( X n + m = j ∣ X m = i ) , m = 0 , 1 , . . . \begin{aligned}
P_{ij}^{(n)} &:= P(X_n=j|X_0=i) \\
&=P(X_{n+m}=j|X_m=i), \quad m=0,1,...
\end{aligned}
P i j ( n ) : = P ( X n = j ∣ X 0 = i ) = P ( X n + m = j ∣ X m = i ) , m = 0 , 1 , . . .
As a special case, let us start with P i j ( 2 ) P_{ij}^{(2)} P i j ( 2 ) and their collection p ( 2 ) = { P i j ( 2 ) } i , j ∈ S p^{(2)}=\{P_{ij}^{(2)}\}_{i,j\in S} p ( 2 ) = { P i j ( 2 ) } i , j ∈ S (also a square matrix, same dimension as P P P )
Condition on what happens at time 1 1 1 :
P i j ( 2 ) = P ( X 2 = j ∣ X 0 = i ) = ∑ j ∈ S P ( X 2 = j ∣ X 0 = i , X 1 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) conditional law of total probability \begin{aligned}
P_{ij}^{(2)} &= P(X_2=j|X_0=i) \\
&= \sum_{j\in S} P(X_2=j|X_0=i, X_1=k) \cdot P(X_1=k | X_0=i) \quad \text{conditional law of total probability}
\end{aligned}
P i j ( 2 ) = P ( X 2 = j ∣ X 0 = i ) = j ∈ S ∑ P ( X 2 = j ∣ X 0 = i , X 1 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) conditional law of total probability
4.2.1. Conditional Law of total probability
P ( X 2 = j ∣ X 0 = i ) = ∑ k ∈ S P ( X 2 = j , X 1 = k ∣ X 0 = i ) = ∑ k ∈ S P ( X 2 = j , X 1 = k , X 0 = i ) P ( X 0 = i ) = ∑ k ∈ S P ( X 2 = j , X 1 = k , X 0 = i ) P ( X 1 = k , X 0 = i ) ⋅ P ( X 1 = k , X 0 = i ) P ( X 0 = i ) = ∑ k ∈ S P ( X 2 = j ∣ X 0 = i , X 1 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) \begin{aligned}
&\quad P(X_2=j|X_0=i) \\
&=\sum_{k\in S} P(X_2=j, X_1=k | X_0=i) \\
&=\sum_{k\in S} \frac{P(X_2=j, X_1=k, X_0=i)}{P(X_0=i)} \\
&=\sum_{k\in S} \frac{P(X_2=j, X_1=k, X_0=i)}{P(X_1=k, X_0=i)} \cdot\frac{P(X_1=k, X_0=i)}{P(X_0=i)} \\
&=\sum_{k\in S} P(X_2=j|X_0=i,X_1=k)\cdot P(X_1=k|X_0=i)
\end{aligned}
P ( X 2 = j ∣ X 0 = i ) = k ∈ S ∑ P ( X 2 = j , X 1 = k ∣ X 0 = i ) = k ∈ S ∑ P ( X 0 = i ) P ( X 2 = j , X 1 = k , X 0 = i ) = k ∈ S ∑ P ( X 1 = k , X 0 = i ) P ( X 2 = j , X 1 = k , X 0 = i ) ⋅ P ( X 0 = i ) P ( X 1 = k , X 0 = i ) = k ∈ S ∑ P ( X 2 = j ∣ X 0 = i , X 1 = k ) ⋅ P ( X 1 = k ∣ X 0 = i )
continue on P i j ( 2 ) P_{ij}^{(2)} P i j ( 2 )
P i j ( 2 ) = P ( X 2 = j ∣ X 0 = i ) = ∑ j ∈ S P ( X 2 = j ∣ X 0 = i , X 1 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) conditional law of total probability = ∑ k ∈ S P ( X 2 = j ∣ X 1 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) = ∑ k ∈ S P ( X 1 = j ∣ X 0 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) = ∑ k ∈ S P i k ⋅ P k j = ( P ⋅ P ) i j \begin{aligned}
P_{ij}^{(2)} &= P(X_2=j|X_0=i) \\
&= \sum_{j\in S} P(X_2=j|X_0=i, X_1=k) \cdot P(X_1=k | X_0=i) \quad \text{conditional law of total probability} \\
&=\sum_{k\in S} P(X_2=j|X_1=k)\cdot P(X_1=k|X_0=i) \\
&=\sum_{k\in S} P(X_1=j|X_0=k)\cdot P(X_1=k|X_0=i) \\
&=\sum_{k\in S} P_{ik}\cdot P_{kj} \\
&= (P \cdot P)_{ij}
\end{aligned}
P i j ( 2 ) = P ( X 2 = j ∣ X 0 = i ) = j ∈ S ∑ P ( X 2 = j ∣ X 0 = i , X 1 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) conditional law of total probability = k ∈ S ∑ P ( X 2 = j ∣ X 1 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) = k ∈ S ∑ P ( X 1 = j ∣ X 0 = k ) ⋅ P ( X 1 = k ∣ X 0 = i ) = k ∈ S ∑ P i k ⋅ P k j = ( P ⋅ P ) i j
Thus, P ( 2 ) = P ⋅ P = P 2 P^{(2)} = P\cdot P=P^2 P ( 2 ) = P ⋅ P = P 2
Using the smae idea, for n , m = 0 , 1 , 2 , 3... n,m=0,1,2,3... n , m = 0 , 1 , 2 , 3 . . . :
P i j ( n + m ) = P ( X n + m = j ∣ X 0 = i ) = ∑ k ∈ S P ( X n + m = j ∣ X 0 = i , X m = k ) ⋅ P ( X m = k ∣ X 0 = i ) = ∑ k ∈ S P ( X n + m = j ∣ X m = k ) ⋅ P ( X m = k ∣ X 0 = i ) Markov property = ∑ k ∈ S P ( X n = j ∣ X 0 = k ) ⋅ P ( X m = k ∣ X 0 = i ) = ∑ k ∈ S p i k ( m ) ⋅ P k j ( n ) = ( P ( m ) ⋅ P ( n ) ) i j ⇒ P ( n + m ) = P ( m ) ⋅ P ( n ) ( ∗ ) \begin{aligned}
P_{ij}^{(n+m)} &=P(X_{n+m}=j|X_0=i) \\
&= \sum_{k\in S}P(X_{n+m}=j|X_0=i, X_m=k) \cdot P(X_m=k|X_0=i) \\
&= \sum_{k\in S}P(X_{n+m}=j|X_m=k) \cdot P(X_m=k|X_0=i) \quad \text{Markov property} \\
&= \sum_{k\in S}P(X_{n}=j|X_0=k) \cdot P(X_m=k|X_0=i) \\
&=\sum_{k\in S} p_{ik}^{(m)}\cdot P_{kj}^{(n)} \\
&=(P^{(m)}\cdot P^{(n)})_{ij} \\
&\Rightarrow P^{(n+m)} = P^{(m)}\cdot P^{(n)} \quad\quad (*)
\end{aligned}
P i j ( n + m ) = P ( X n + m = j ∣ X 0 = i ) = k ∈ S ∑ P ( X n + m = j ∣ X 0 = i , X m = k ) ⋅ P ( X m = k ∣ X 0 = i ) = k ∈ S ∑ P ( X n + m = j ∣ X m = k ) ⋅ P ( X m = k ∣ X 0 = i ) Markov property = k ∈ S ∑ P ( X n = j ∣ X 0 = k ) ⋅ P ( X m = k ∣ X 0 = i ) = k ∈ S ∑ p i k ( m ) ⋅ P k j ( n ) = ( P ( m ) ⋅ P ( n ) ) i j ⇒ P ( n + m ) = P ( m ) ⋅ P ( n ) ( ∗ )
By definition, P ( 1 ) = P P^{(1)}=P P ( 1 ) = P
⇒ \Rightarrow ⇒ P ( 2 ) = P ( 1 ) ⋅ P ( 1 ) = P 2 P^{(2)}=P^{(1)}\cdot P^{(1)}=P^2 P ( 2 ) = P ( 1 ) ⋅ P ( 1 ) = P 2
⇒ \Rightarrow ⇒ P ( 3 ) = P ( 2 ) ⋅ P ( 1 ) = P 3 P^{(3)}=P^{(2)}\cdot P^{(1)}=P^3 P ( 3 ) = P ( 2 ) ⋅ P ( 1 ) = P 3
⋯ ⋯ ⋯ \cdots\cdots\cdots ⋯ ⋯ ⋯
⇒ \Rightarrow ⇒ P ( n ) = P n P^{(n)}=P^n P ( n ) = P n
Note:
n n n from P ( n ) P^{(n)} P ( n ) : n-step transition probability matrix
P ( n ) = { P i j ( n ) } i , j ∈ S P i j ( n ) = P ( X n = j ∣ X 0 = i ) P^{(n)}=\{P_{ij}^{(n)}\}_{i,j\in S} \\ P_{ij}^{(n)}=P(X_n=j|X_0=i) P ( n ) = { P i j ( n ) } i , j ∈ S P i j ( n ) = P ( X n = j ∣ X 0 = i )
n n n from P n P^n P n : n-th power of the (one-step) transition matrix
P n = p ⋅ . . . ⋅ P P = { P i j } i , j ∈ S P i j = P ( X 1 = j ∣ X 0 = i ) P^n=p\cdot...\cdot P \\ P=\{P_{ij}\}_{i,j\in S} \\ P_{ij}=P(X_1=j|X_0=i) P n = p ⋅ . . . ⋅ P P = { P i j } i , j ∈ S P i j = P ( X 1 = j ∣ X 0 = i )
( ∗ ) (*) ( ∗ ) is called the Chapman-Kolmogorov equations (c-k equation). Entry-wise:
P i j n + m = ∑ k ∈ S P i k ( m ) P k j ( n ) P_{ij}^{n+m}=\sum_{k\in S}P_{ik}^{(m)}P_{kj}^{(n)}
P i j n + m = k ∈ S ∑ P i k ( m ) P k j ( n )
Intuition :
"Condition at time m m m (on X m X_m X m ) and sum p all the possibilities"
4.2.2. Distribution of X n X_n X n
So far, we have seen transition probability P i j ( n ) = P ( X n = j ∣ X 0 = i ) P_{ij}^{(n)}=P(X_n=j|X_0=i) P i j ( n ) = P ( X n = j ∣ X 0 = i ) . This is not the probability P ( X n = j ) P(X_n=j) P ( X n = j ) . In order to get this distribution, we need the information about which state the Markov chain starts with.
Let α 0 , i = P ( X 0 = i ) \alpha_{0,i}=P(X_0=i) α 0 , i = P ( X 0 = i ) . The row vector α 0 = ( α 0 , 0 , α 0 , 1 , . . . ) \alpha_0=(\alpha_{0,0},\alpha_{0,1},...) α 0 = ( α 0 , 0 , α 0 , 1 , . . . ) is called the initial distribution of the Markov chain. This is the distribution of the initial state X 0 X_0 X 0
Similarly, we define distribution of X n X_n X n : α n = ( α n , 0 , α n , 1 , . . . ) \alpha_n=(\alpha_{n,0},\alpha_{n,1},...) α n = ( α n , 0 , α n , 1 , . . . ) where α n , i = P ( X n = i ) \alpha_{n,i}=P(X_n=i) α n , i = P ( X n = i )
Fact : α n = α 0 ⋅ p n \alpha_n=\alpha_0\cdot p^n α n = α 0 ⋅ p n
Proof :
∀ j ∈ S α n , j = P ( X n = j ) = ∑ i ∈ S P ( X n = j ∣ X 0 = i ) ⋅ P ( X 0 = i ) = ∑ i ∈ S α 0 , i ⋅ P i j ( n ) = ( α 0 ⋅ P ( n ) ) j = ( α 0 ⋅ P n ) j \forall j \in S \\\\
\begin{aligned}
\alpha_{n,j} &=P(X_n=j) \\
&= \sum_{i\in S} P(X_n=j|X_0=i)\cdot P(X_0=i) \\
&= \sum_{i\in S} \alpha_{0,i}\cdot P_{ij}^{(n)} \\
&=(\alpha_0\cdot P^{(n)})_j = (\alpha_0\cdot P^n)_j \\
\end{aligned}
∀ j ∈ S α n , j = P ( X n = j ) = i ∈ S ∑ P ( X n = j ∣ X 0 = i ) ⋅ P ( X 0 = i ) = i ∈ S ∑ α 0 , i ⋅ P i j ( n ) = ( α 0 ⋅ P ( n ) ) j = ( α 0 ⋅ P n ) j
⇒ α n = α 0 ⋅ P n \Rightarrow \alpha_n =\alpha_0\cdot P^n
⇒ α n = α 0 ⋅ P n
α n \alpha_n α n : distribution of X n X_n X n
α 0 \alpha_0 α 0 : initial distribution
P n P^n P n : transition matrix
Remark : The distribution of a DTMC is completely determined by two things:
the initial distribution α 0 \alpha_0 α 0 (row vector), and
the transition matrix P P P (square matrix)
4.3. Stationary distribution (invariant distribution)
Definition : A probability distribution π = ( π 0 , π 1 , . . . ) \pi = (\pi_0,\pi_1,...) π = ( π 0 , π 1 , . . . ) is called a stationary distribution (invariant distribution) of the DTMC { X n } n = 0 , 1 , . . . \{X_n\}_{n=0,1,...} { X n } n = 0 , 1 , . . . with transition matrix P P P , if :
π ‾ = π ⋅ P \underline{\pi}=\pi\cdot P π = π ⋅ P
∑ i ∈ S π i = 1 ( ⇔ π ‾ ⋅ 1 ​ ​ ​ ​ ⊥ ) \sum_{i\in S}\pi_i = 1 (\Leftrightarrow \underline{\pi} \cdot 1\!\!\!\!\perp) ∑ i ∈ S π i = 1 ( ⇔ π ⋅ 1 ⊥ ) . (1 ​ ​ ​ ​ ⊥ 1\!\!\!\!\perp 1 ⊥ : a column of all 1's)
Why such π ‾ \underline{\pi} π is called stationary/invariant distribution?
∑ i ∈ S π i = 1 , π i ≥ 0 , i = 0 , 1 , . . . ⇒ distribution π ‾ = π ⋅ P ⇒ invariant/stationary. \sum_{i\in S} \pi_i = 1, \pi_i \geq 0, i=0,1,... \Rightarrow \text{distribution} \\
\underline{\pi} = \pi\cdot P \Rightarrow \text{invariant/stationary.}
i ∈ S ∑ π i = 1 , π i ≥ 0 , i = 0 , 1 , . . . ⇒ distribution π = π ⋅ P ⇒ invariant/stationary.
Assume the MC starts from the initial distribution α 0 = π ‾ \alpha_0=\underline{\pi} α 0 = π . hen the distribution of X 1 X_1 X 1 is
α 1 = α 0 ⋅ P = π ‾ ⋅ P = π ‾ = α 0 \alpha_1
=\alpha_0 \cdot P
=\underline{\pi}\cdot P
= \underline{\pi}
= \alpha_0
α 1 = α 0 ⋅ P = π ⋅ P = π = α 0
The distribution of X 2 X_2 X 2 :
α 2 = α 0 ⋅ P 2 = π ‾ ⋅ P ⋅ P = π ‾ ⋅ P = π ‾ = α 0 \alpha_2=\alpha_0\cdot P^2 =\underline{\pi}\cdot P\cdot P = \underline{\pi} \cdot P = \underline{\pi} = \alpha_0
α 2 = α 0 ⋅ P 2 = π ⋅ P ⋅ P = π ⋅ P = π = α 0
⋯ ⋯ \cdots\cdots
⋯ ⋯
α n = α 0 \alpha_n=\alpha_0
α n = α 0
Thus, if the MC starts from a stationary distribution, then its distribution will not change over time.
Example 4.3.1
An electron has two states: g r o u n d ( 0 ) , e x c i t e d ( 1 ) ground(0), excited(1) g r o u n d ( 0 ) , e x c i t e d ( 1 ) . Let X n ∈ { 0 , 1 } X_n\in\{0,1\} X n ∈ { 0 , 1 } be the state at time n n n .
At each step, changes state with probability:
α \alpha α if it is in state 0 0 0 .
β \beta β if it is in state 1 1 1 .
Then { X n } \{X_n\} { X n } is a DTMC. Its transitional matrix is:
P = { 1 − α α β 1 − β } P=\begin{Bmatrix}
1-\alpha & \alpha \\
\beta & 1-\beta
\end{Bmatrix}
P = { 1 − α β α 1 − β }
Now let us solve for the stationary distribution π ‾ = π ‾ ⋅ P \underline{\pi} =\underline{\pi}\cdot P π = π ⋅ P .
( π 0 , π 1 ) ( 1 − α α β 1 − β ) = ( π 0 , π 1 ) (\pi_0,\pi_1) \begin{pmatrix}
1-\alpha & \alpha \\
\beta & 1-\beta
\end{pmatrix}
=(\pi_0,\pi_1)
( π 0 , π 1 ) ( 1 − α β α 1 − β ) = ( π 0 , π 1 )
⇒ { π 0 ( 1 − α ) + π 1 β = π 0 ( 1 ) π 0 α + π 1 ( 1 − β ) = π 1 ( 2 ) \Rightarrow \begin{cases}
\pi_0(1-\alpha) + \pi_1\beta=\pi_0\quad(1) \\
\pi_0\alpha + \pi_1(1-\beta) = \pi_1\quad(2)
\end{cases}
⇒ { π 0 ( 1 − α ) + π 1 β = π 0 ( 1 ) π 0 α + π 1 ( 1 − β ) = π 1 ( 2 )
We have two equations and two unknowns. However, note that they are not linearly independent:
sum of LHS = π 0 + π 1 = =\pi_0+\pi_1= = π 0 + π 1 = sum of RHS. Hence ( 2 ) (2) ( 2 ) can be derived from ( 1 ) (1) ( 1 ) . By ( 1 ) (1) ( 1 ) , we have:
α π 0 = β π 1 or π 0 π 1 = β α \alpha\pi_0=\beta\pi_1\quad\text{or}\quad\frac{\pi_0}{\pi_1}=\frac{\beta}{\alpha}
α π 0 = β π 1 or π 1 π 0 = α β
This where we need π ‾ ⋅ 1 ​ ​ ​ ​ ⊥ \underline{\pi}\cdot1\!\!\!\!\perp π ⋅ 1 ⊥ :
π 0 + π 1 = 1 ⇒ π 0 = β α + β , π 1 = α α + β \pi_0+\pi_1=1 \Rightarrow\pi_0=\frac{\beta}{\alpha+\beta},\quad \pi_1 =\frac{\alpha}{\alpha+\beta}
π 0 + π 1 = 1 ⇒ π 0 = α + β β , π 1 = α + β α
Thus, we conclude that there exists a unique stationary distribution ( β α + β , α α + β ) = π ‾ (\frac{\beta}{\alpha+\beta},\frac{\alpha}{\alpha+\beta})=\underline{\pi} ( α + β β , α + β α ) = π
The above procedure for solving for stationary distribution is typical:
Use π ‾ = π ‾ P \underline{\pi}=\underline{\pi}P π = π P to get the properties between different components of π ‾ \underline{\pi} π
Use π ‾ ⋅ 1 ​ ​ ​ ​ ⊥ = 1 \underline{\pi}\cdot 1\!\!\!\!\perp = 1 π ⋅ 1 ⊥ = 1 to normalize (get exact values)
4.4. Classification of States
4.4.1. Transience and Recurrence
Let T T T : be the waiting for a MC to visit/revisit state i i i for the first time
T i : = m i n { n > 0 : X n = i } T i is a r.v. T_i:=min\{n>0:X_n=i\}\quad\quad T_i \text{ is a r.v.}
T i : = m i n { n > 0 : X n = i } T i is a r.v.
T i = ∞ T_i=\infty T i = ∞ if the MC never (re)visits state i i i .
Definition 4.4.1. Transience and Recurrence
A state i i i is called:
transient, if P ( T i < ∞ ∣ X 0 = i ) < 1 \mathbb{P}(T_i<\infty|X_0=i) < 1 P ( T i < ∞ ∣ X 0 = i ) < 1 (never goes back to i i i positive probability)
recurrent, if P ( T i < ∞ ∣ X 0 = i ) = 1 \mathbb{P}(T_i<\infty|X_0=i) = 1 P ( T i < ∞ ∣ X 0 = i ) = 1 (always goes back to state i i i )
positive recurrent, if E ( T i ∣ X 0 = i ) < ∞ \mathbb{E}(T_i|X_0=i)<\infty E ( T i ∣ X 0 = i ) < ∞
null recurrent, if E ( T i ∣ X 0 = i ) = ∞ \mathbb{E}(T_i|X_0=i)=\infty E ( T i ∣ X 0 = i ) = ∞
(note: a r.v. is finite with probability ⇒ \cancel{\Rightarrow} ⇒ its expectation is finite)
Example: T = 2 , 4 , . . . , 2 n , p = 1 2 , 1 4 , . . . , 2 − n T=2,4,...,2^n, p=\frac{1}{2},\frac{1}{4},...,2^{-n}\\ T = 2 , 4 , . . . , 2 n , p = 2 1 , 4 1 , . . . , 2 − n E ( T ) = 2 ⋅ 1 2 + 4 ⋅ 1 4 + . . . + 2 n ⋅ 2 − n = ∞ \mathbb{E}(T)=2\cdot\frac{1}{2} + 4\cdot\frac{1}{4}+...+2^n\cdot 2^{-n}=\infty E ( T ) = 2 ⋅ 2 1 + 4 ⋅ 4 1 + . . . + 2 n ⋅ 2 − n = ∞
Example 4.4.1
P = ( 1 2 1 2 1 2 1 2 1 ) P=\begin{pmatrix}
\frac{1}{2} & \frac{1}{2} \\ \\
& \frac{1}{2} & \frac{1}{2} \\ \\
& & 1
\end{pmatrix}
P = ⎝ ⎜ ⎜ ⎜ ⎜ ⎛ 2 1 2 1 2 1 2 1 1 ⎠ ⎟ ⎟ ⎟ ⎟ ⎞
Given X 0 = 0 X_0=0 X 0 = 0 ,
P ( X 1 = 0 ⎵ T 0 = 1 ∣ X 0 = 0 ) = P ( X 1 = 1 ⎵ T 0 = ∞ since state 1 and 2 do not go to 0 ∣ X 0 = 0 ) = 1 2 ⇒ P ( T 0 < ∞ ∣ X 0 = 0 ) = 1 2 < 1 P(\underbrace{X_1=0}_{T_0=1}|X_0=0)=P(\underbrace{X_1=1}_{T_0=\infty \text{ since state 1}\atop\text{and 2 do not go to 0}}|X_0=0)=\frac{1}{2} \quad\Rightarrow\quad P(T_0<\infty|X_0=0)=\frac{1}{2}<1
P ( T 0 = 1 X 1 = 0 ∣ X 0 = 0 ) = P ( and 2 do not go to 0 T 0 = ∞ since state 1 X 1 = 1 ∣ X 0 = 0 ) = 2 1 ⇒ P ( T 0 < ∞ ∣ X 0 = 0 ) = 2 1 < 1
Thus, state 0 0 0 is transient
Similarly, state 1 1 1 is transient.
Given X 0 = 2 X_0=2 X 0 = 2 ,
P ( X 1 = 2 ∣ X 0 = 2 ) ⇒ P ( T 2 < ∞ ∣ X 0 = 2 ) = 1 P(X_1=2|X_0=2)\Rightarrow P(T_2<\infty|X_0=2) = 1
P ( X 1 = 2 ∣ X 0 = 2 ) ⇒ P ( T 2 < ∞ ∣ X 0 = 2 ) = 1
As E ( T 2 ∣ X 0 = 2 ) = 1 E(T_2|X_0=2)=1 E ( T 2 ∣ X 0 = 2 ) = 1 Thus, state 2 2 2 is a positive recurrence.
In general, the distribution of T i T_i T i is very hard to determine ⇒ \Rightarrow ⇒ need better criteria for recurrence/transience.
Criteria (1) : Define f i i = P ( T i < ∞ ∣ X 0 = i ) f_{ii}=P(T_i<\infty|X_0=i) f i i = P ( T i < ∞ ∣ X 0 = i ) , and
V i = # of times that the MC (revisits) state i = ∑ n = 1 ∞ 1 ​ ​ ​ ​ ⊥ { X n = i } V_i= \text{\# of times that the MC (revisits) state i}=\sum_{n=1}^\infty 1\!\!\!\!\perp_{\{X_n=i\}}
V i = # of times that the MC (revisits) state i = n = 1 ∑ ∞ 1 ⊥ { X n = i }
If state i i i is transient
P ( V i = k ∣ X 0 = j ) = f i i k ⎵ goes back to i for k times ( 1 − f i i ⎵ never visits i again ) P(V_i=k|X_0=j)=\underbrace{f_{ii}^k}_{\text{goes back to}\atop\text{$i$ for $k$ times}}(\underbrace{1-f_{ii}}_{\text{never visits}\atop\text{$i$ again}})
P ( V i = k ∣ X 0 = j ) = i for k times goes back to f i i k ( i again never visits 1 − f i i )
⇒ V i + 1 ∼ G e o ( 1 − f i i ) \Rightarrow V_i+1\sim Geo(1-f_{ii})
⇒ V i + 1 ∼ G e o ( 1 − f i i )
In particular, P ( V i < ∞ ∣ X 0 = i ) = 1 ⇒ P(V_i<\infty|X_0=i)=1\Rightarrow P ( V i < ∞ ∣ X 0 = i ) = 1 ⇒ If state i i i is transient, it is visited away finitely many times with probability 1. The MC will leave state i i i forever sooner or later.
On the other hand, if state i i i is recurrent, then f i i = 1 f_{ii}=1 f i i = 1
P ( V i = k ) = 0 k = 0 , 1 , . . . ⇒ P ( V 1 = ∞ ) = 1 P(V_i=k)=0\quad k=0,1,... \Rightarrow P(V_1=\infty)=1
P ( V i = k ) = 0 k = 0 , 1 , . . . ⇒ P ( V 1 = ∞ ) = 1
If the MC starts at a recurrent state i i i , it will visit that state infinitely many times. \\
Criteria (2) : In terms of E ( V i ∣ X 0 = i ) E(V_i|X_0=i) E ( V i ∣ X 0 = i ) :
E ( V i ∣ X 0 = i ) = 1 1 − f i i − 1 = f i i 1 − f i i < ∞ if f i i < 1 , ( i transient ) E ( V i ∣ X 0 = i ) = ∞ , if f i i = 1 , ( i recurrent ) \begin{aligned}
&E(V_i|X_0=i)=\frac{1}{1-f_{ii}} - 1 = \frac{f_{ii}}{1-f_{ii}} < \infty &\text{if } f_{ii} < 1, (i \text{ transient}) \\
&E(V_i|X_0=i)=\infty, &\text{if } f_{ii}=1, (i \text{ recurrent})
\end{aligned}
E ( V i ∣ X 0 = i ) = 1 − f i i 1 − 1 = 1 − f i i f i i < ∞ E ( V i ∣ X 0 = i ) = ∞ , if f i i < 1 , ( i transient ) if f i i = 1 , ( i recurrent )
Criteria (3) : Note that
E ( V i ∣ X 0 = i ) = E ( ∑ n = 1 ∞ 1 ​ ​ ​ ​ ⊥ { X n = i } ∣ X 0 = i ) = ∑ n = 1 ∞ E ( 1 ​ ​ ​ ​ ⊥ { X n = i } ∣ X 0 = i ) = ∑ n = 1 ∞ P ( X n = i ∣ X 0 = i ) = ∑ n = 1 ∞ P i i ( n ) \begin{aligned}
E(V_i|X_0=i) &= E(\sum_{n=1}^\infty 1\!\!\!\!\perp_{\{X_n=i\}}|X_0=i) \\
&= \sum_{n=1}^\infty E( 1\!\!\!\!\perp_{\{X_n=i\}}|X_0=i) \\
&= \sum_{n=1}^\infty P(X_n=i|X_0=i) \\
&=\sum_{n=1}^\infty P_{ii}^{(n)}
\end{aligned}
E ( V i ∣ X 0 = i ) = E ( n = 1 ∑ ∞ 1 ⊥ { X n = i } ∣ X 0 = i ) = n = 1 ∑ ∞ E ( 1 ⊥ { X n = i } ∣ X 0 = i ) = n = 1 ∑ ∞ P ( X n = i ∣ X 0 = i ) = n = 1 ∑ ∞ P i i ( n )
⇒ ∑ n = 1 ∞ P i i ( n ) < ∞ if i transient ⇒ ∑ n = 1 ∞ P i i ( n ) = ∞ if i recurrent \begin{aligned}
\Rightarrow &\sum_{n=1}^\infty P_{ii}^{(n)} < \infty &\text{if } i \text{ transient} \\
\Rightarrow &\sum_{n=1}^\infty P_{ii}^{(n)} = \infty &\text{if } i \text{ recurrent} \\
\end{aligned}
⇒ ⇒ n = 1 ∑ ∞ P i i ( n ) < ∞ n = 1 ∑ ∞ P i i ( n ) = ∞ if i transient if i recurrent
To conclude,
d e f i n e : i r e c u r e e n t t r a n s i e n t P ( T i < i n f t y ∣ X 0 = i ) = 1 P ( T i < ∞ ∣ X 0 = i ) < 1 P ( V i = ∞ ∣ X 0 = i ) = 1 P ( V i < ∞ ∣ X 0 = i ) = 1 E ( V i ∣ X 0 = i ) = ∞ E ( V i ∣ X 0 = i ) < ∞ easiest to use: ∑ n = 1 ∞ P i i ( n ) = ∞ ∑ n = 1 ∞ P i i ( n ) < ∞ define:\quad
\begin{aligned}
i\quad\quad & recureent & transient \\
&P(T_i<infty |X_0=i) = 1 \quad& P(T_i<\infty|X_0=i)<1 \\
&P(V_i=\infty|X_0=i)=1&P(V_i<\infty|X_0=i)=1 \\
&E(V_i|X_0=i)=\infty &E(V_i|X_0=i)<\infty \\
\text{easiest to use: } &\sum_{n=1}^\infty P_{ii}^{(n)}=\infty &\sum_{n=1}^\infty P_{ii}^{(n)}<\infty
\end{aligned}
d e f i n e : i easiest to use: r e c u r e e n t P ( T i < i n f t y ∣ X 0 = i ) = 1 P ( V i = ∞ ∣ X 0 = i ) = 1 E ( V i ∣ X 0 = i ) = ∞ n = 1 ∑ ∞ P i i ( n ) = ∞ t r a n s i e n t P ( T i < ∞ ∣ X 0 = i ) < 1 P ( V i < ∞ ∣ X 0 = i ) = 1 E ( V i ∣ X 0 = i ) < ∞ n = 1 ∑ ∞ P i i ( n ) < ∞
4.4.2. Periodicity
Example:
P = ( 1 1 2 1 2 1 2 1 2 1 ) P=
\begin{pmatrix}
& 1 & & \\
\frac{1}{2} & & \frac{1}{2} & \\
&\frac{1}{2}&&\frac{1}{2}\\
&&1&
\end{pmatrix} P = ⎝ ⎜ ⎜ ⎛ 2 1 1 2 1 2 1 1 2 1 ⎠ ⎟ ⎟ ⎞
Note that if we starts from 0, we can only get back to 0 in 2 , 4 , 6 , ⋯ 2, 4, 6,\cdots 2 , 4 , 6 , ⋯ , i.t., even number of steps P 00 ( 2 n + 1 ) = 0 , ∀ n P_{00}^{(2n+1)}=0,\quad \forall n P 0 0 ( 2 n + 1 ) = 0 , ∀ n
Definition 4.4.2. Period
The period of state i i i is defined as
d i = g c d ⎵ greates common divisor ( { n : P i i ( n ) > 0 ⎵ i can go back to i in n steps } ) d_i=\underbrace{gcd}_{\text{greates}\atop\text{common divisor}}(\{n:\underbrace{P_{ii}^{(n)}>0}_{\text{$i$ can go back} \atop\text{to $i$ in $n$ steps} }\})
d i = common divisor greates g c d ( { n : to i in n steps i can go back P i i ( n ) > 0 } )
In this example above, d 0 = g c d ( { even numbers } ) = 2 d_0=gcd(\{\text{even numbers}\}) = 2 d 0 = g c d ( { even numbers } ) = 2
If d i = 1 d_i=1 d i = 1 , state i i i is called "aperiodic"
If ∃ n > 0 \cancel{\exists} n > 0 ∃ n > 0 such that P i i ( n ) > 0 P_{ii}^{(n)}>0 P i i ( n ) > 0 , then d i = ∞ d_i=\infty d i = ∞
Note that P i i > 0 ⇒ d i = 1 P_{ii} > 0 \Rightarrow d_i = 1 P i i > 0 ⇒ d i = 1 . The converse is not true .
P 00 ( 2 ) > 0 , P 00 ( 3 ) > 0 ⇒ d 0 = 1 but P 00 = 0 P_{00}^{(2)} >0, P_{00}^{(3)} > 0 \Rightarrow d_0 =1 \text{ but } P_{00}=0
P 0 0 ( 2 ) > 0 , P 0 0 ( 3 ) > 0 ⇒ d 0 = 1 but P 0 0 = 0
In general, d i = d ⇒ P i i ( d ) > 0 d_i=d \cancel{\Rightarrow} P_{ii}^{(d)}>0 d i = d ⇒ P i i ( d ) > 0
4.4.3. Equivalent classes and irreducibility
Definition 4.4.3.1. Assessable
Let { X n } n = 0 , 1 , ⋯ \{X_n\}_n=0,1,\cdots { X n } n = 0 , 1 , ⋯ be a DTMC with state space S S S . State j j j is said to be assessable ‾ \text{\underline{assessable}} a s s e s s a b l e from state i i i , denoted by i → j i\rightarrow j i → j , if P i j ( n ) > 0 P_{ij}^{(n)}>0 P i j ( n ) > 0 for some n ≥ 0 n\geq 0 n ≥ 0 .
Intuitively, i i i can go to state j j j in finite steps.
Definition 4.4.3.2. Communicate
If i → j i\rightarrow j i → j and j → i j\rightarrow i j → i , we say i i i and j j j communicate , denoted by i ↔ j i\leftrightarrow j i ↔ j .
Fact 4.4.3.1
"Communication" is an equivalence relation.
i ↔ j i\leftrightarrow j i ↔ j then P i i ( 0 ) = 1 = P ( X 0 = i ∣ X 0 = i ) P_{ii}^{(0)}= 1 \bold{=} P(X_0=i|X_0=i) P i i ( 0 ) = 1 = P ( X 0 = i ∣ X 0 = i ) (Identity)
i ↔ j i\leftrightarrow j i ↔ j then j ↔ i j\leftrightarrow i j ↔ i (symmetry)
i ↔ j , j ↔ k i\leftrightarrow j, j\leftrightarrow k i ↔ j , j ↔ k , then i ↔ k i\leftrightarrow k i ↔ k (transitivity)
Definition 4.4.3.3. Class
As a result, we can use "↔ \leftrightarrow ↔ " to divide the state space into different classes , each containing only the states which communicate with each other.
{ S = ⋃ k C k ( { C n } is a partition of S ) C k ⋂ C k ′ = ∅ , k = k ′ \begin{cases}
S=\bigcup_kC_k \quad\quad\quad\quad \text{($\{C_n\}$ is a partition of $S$)}\\
C_k\bigcap C_k' = \emptyset, k\cancel{=}k'
\end{cases} { S = ⋃ k C k ( { C n } is a partition of S ) C k ⋂ C k ′ = ∅ , k = k ′
For state i i i and j j j in the same class C k C_k C k , i ↔ j i\leftrightarrow j i ↔ j .
For i i i j j j in different classes, i ↔ j i\cancel{\leftrightarrow} j i ↔ j (i → j i\cancel{\rightarrow} j i → j or j → i j\cancel{\rightarrow} i j → i )
Definition 4.4.3. Irreducible
A MC is called irreducible , if it has only one class. In other words, i ↔ j i\leftrightarrow j i ↔ j for any i , j ∈ S i, j\in S i , j ∈ S
-Q: How to find equivalent classes?
-A: "Draw a graph and find the loops"
Example 4.4.3.1. Find the classes
P = ( 1 2 1 2 1 2 1 2 1 4 1 4 1 4 1 ) P=\begin{pmatrix}
\frac{1}{2} & \frac{1}{2} & & \\ \\
\frac{1}{2}&\frac{1}{2}&&\\ \\
\frac{1}{4}&\frac{1}{4}&\frac{1}{4}&\\ \\
&&&1
\end{pmatrix} P = ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎛ 2 1 2 1 4 1 2 1 2 1 4 1 4 1 1 ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎞
Draw an arrow from i i i to j j j if P i j > 0 P_{ij} > 0 P i j > 0
P 01 > 0 , P 10 > 0 ⇒ 0 ↔ 1 P_{01}>0, P{10}>0 \Rightarrow 0\leftrightarrow1 P 0 1 > 0 , P 1 0 > 0 ⇒ 0 ↔ 1
State 2 does not communicate with any other state, since P i 2 = 0 , i = 2 P_{i2}=0, i\cancel{=}2 P i 2 = 0 , i = 2
State 3 does not communicate with any other state, since P i 3 = 0 , i = 3 P_{i3}=0, i\cancel{=}3 P i 3 = 0 , i = 3
⇒ \Rightarrow ⇒ 3 classes: { 0 , 1 } , { 2 } , { 3 } \{0,1\}, \{2\}, \{3\} { 0 , 1 } , { 2 } , { 3 }
Example 4.4.3.2. Find the classes
P = ( 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 ) P=\begin{pmatrix}
\frac{1}{2} & \frac{1}{2} & & \\ \\
\frac{1}{2}&&\frac{1}{2}&\\ \\
\frac{1}{2}&&&\frac{1}{2}\\ \\
&&\frac{1}{2}&\frac{1}{2}
\end{pmatrix} P = ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎛ 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎞
P 01 , P 12 , P 20 > 0 ⇒ 0 , 1 , 2 P_{01}, P_{12}, P_{20} > 0 \Rightarrow 0,1,2 P 0 1 , P 1 2 , P 2 0 > 0 ⇒ 0 , 1 , 2 are in the same class
P 23 , P 32 > 0 ⇒ 2 , 3 P_{23}, P_{32} > 0 \Rightarrow 2,3 P 2 3 , P 3 2 > 0 ⇒ 2 , 3 are in the same class
Transitivity ⇒ \Rightarrow ⇒ 0,1,2,3 are all in the same class.
⇒ \Rightarrow ⇒ This MC is irreducible
Fact 4.4.3.2
Preposition Transience/Recurrence are class properties. That is, if i ↔ j i\leftrightarrow j i ↔ j , then j j j is transient/recurrent if and only if i i i is transient/recurrent
Proof :
Suppose i i i is recurrent, then ∑ k = 1 ∞ P i i ( k ) = ∞ \sum_{k=1}^\infty P_{ii}^{(k)}=\infty ∑ k = 1 ∞ P i i ( k ) = ∞
Since i ↔ j i \leftrightarrow j i ↔ j , ∃ m , n \exists m,n ∃ m , n such that P i j ( m ) > 0 , P i j ( n ) > 0 P_{ij}^{(m)}>0, P_{ij}^{(n)}>0 P i j ( m ) > 0 , P i j ( n ) > 0
Note that
P j j ( m + n + k ) ⎵ P ( X m + n + k = j ∣ X 0 = j ) ≥ P j i ( n ) P i i ( k ) P i j ( m ) ⎵ P ( X m + n + k = j , X n + k = i , X n = i ∣ X 0 = j ) ⇒ ∑ l = 1 ∞ P j j ( l ) ≥ ∑ l = m + n + 1 ∞ P j j ( l ) = ∑ k = 1 ∞ P j j ( m + n + k ) ≥ ∑ k = 1 ∞ P j i ( n ) P i i ( k ) P j i ( m ) = P j j ( n ) ⎵ 0 P i j ( m ) ⎵ 0 ∑ k = 1 ∞ P i i ( k ) ⎵ ∞ = ∞ \begin{aligned}
\underbrace{P_{jj}^{(m+n+k)}}_{P(X_{m+n+k}=j|X_0=j)} \geq \underbrace{P_{ji}^{(n)}P_{ii}^{(k)}P_{ij}^{(m)}}_{P(X_{m+n+k}=j, X_{n+k}=i, X_n=i|X_0=j)}
\Rightarrow \sum_{l=1}^\infty P_{jj}^{(l)}
&\geq \sum_{l=m+n+1}^\infty P_{jj}^{(l)} \\
& = \sum_{k=1}^\infty P_{jj}^{(m+n+k)} \\
& \geq \sum_{k=1}^\infty P_{ji}^{(n)}P_{ii}^{(k)}P_{ji}^{(m)} \\
& = \underbrace{P_{jj}^{(n)}}_{0}\underbrace{P_{ij}^{(m)}}_{0}\ \underbrace{\sum_{k=1}^\infty P_{ii}^{(k)}}_\infty = \infty
\end{aligned}
P ( X m + n + k = j ∣ X 0 = j ) P j j ( m + n + k ) ≥ P ( X m + n + k = j , X n + k = i , X n = i ∣ X 0 = j ) P j i ( n ) P i i ( k ) P i j ( m ) ⇒ l = 1 ∑ ∞ P j j ( l ) ≥ l = m + n + 1 ∑ ∞ P j j ( l ) = k = 1 ∑ ∞ P j j ( m + n + k ) ≥ k = 1 ∑ ∞ P j i ( n ) P i i ( k ) P j i ( m ) = 0 P j j ( n ) 0 P i j ( m ) ∞ k = 1 ∑ ∞ P i i ( k ) = ∞
Thus, j j j is recurrent. Symmetrically, j j j is recurrent ⇒ \Rightarrow ⇒ i i i is recurrent
Thus,
i i i recurrent ⇔ \Leftrightarrow ⇔ j j j recurrent
i i i transient ⇔ \Leftrightarrow ⇔ j j j transient
For irreducible MC, since recurrence and transience are class properties, we also say the Markov Chain is recurrent/transient
Definition 4.4.3.5. Proposition
If an irreducible MC has a finite state space, then it is recurrent
Idea of proof
If the MC is transient, then with probability 1, each state has a last visit time. Finite states ⇒ \Rightarrow ⇒ ∃ \exists ∃ a last visit time for all the states. As a result, the MC has nowhere to go after that time. ⇒ \Rightarrow ⇒ Contradiction.
We can actually prove that the MC must be positive recurrent, if the state space is finite and the MC is irreducible.
Theorem 4.4.3.1
Periodicity is a class property: i ↔ j ⇒ d i = d j i\leftrightarrow j\Rightarrow d_i=d_j i ↔ j ⇒ d i = d j .
For an irreducible MC, its period is defined as the period of any state.
4.5. Limiting Distribution
In this part, we are interested in l i m n → ∞ P i j ( n ) lim_{n\rightarrow\infty} P_{ij}^{(n)} l i m n → ∞ P i j ( n ) and l i m n → ∞ P ( X n = i ) lim_{n\rightarrow\infty}P(X_n=i) l i m n → ∞ P ( X n = i )
To make things simple, we focus on the irreducible case.
Theorem 4.5.1. Basic Limit Theorem
Let { X n } n = 0 , 1 , … \{X_n\}_{n=0,1,\ldots} { X n } n = 0 , 1 , … be an irreducible, aperiodic, positive recurrent DTMC. Then a unique stationary distribution:
π ‾ = ( π 0 , π 1 , … ) exits \underline{\pi}=(\pi_0,\pi_1,\ldots) \text{ exits}
π = ( π 0 , π 1 , … ) exits
Moreover:
( ∗ ) l i m n → ∞ P i j ( n ) ⎵ limiting distribution (does not depend on the initial state i) = l i m n → ∞ ∑ k = 1 n I { X k = j } n ⎵ long-run fraction of time spent in j = 1 E ( T j ∣ X 0 = j ) ⎵ T j = m i n { n > 0 : X n = j } expected revisit time = π j , i , j ∈ S
(*)\underbrace{lim_{n\rightarrow\infty} P_{ij}^{(n)}}_{\text{limiting distribution}\atop\text{(does not depend on the initial state i)} }
=lim_{n\rightarrow\infty}
\underbrace{\frac{\sum_{k=1}^n\mathbb{I}_{\{X_k=j\}}}{n}}_{\text{long-run fraction of time spent in j}}
=\underbrace{\frac{1}{\mathbb{E}(T_j|X_0=j)}}_{\text{$T_j=min\{n>0:X_n=j\}$}\atop\text{expected revisit time}}
=\pi_j\quad\quad, i,j\in S
( ∗ ) (does not depend on the initial state i) limiting distribution l i m n → ∞ P i j ( n ) = l i m n → ∞ long-run fraction of time spent in j n ∑ k = 1 n I { X k = j } = expected revisit time T j = min { n > 0 : X n = j } E ( T j ∣ X 0 = j ) 1 = π j , i , j ∈ S
Limiting distribution =
long-run fraction of time
1 / 1 / 1 / expected revisit time
stationary distribution
The result ( ∗ ) (*) ( ∗ ) is still true if the MC is null recurrent, where all the terms are 0 , and π ‾ \underline{\pi} π is no longer a distribution. (in other words, there does not exist a stationary distribution)
If { X n } n = 0 , 1 , … \{X_n\}_{n=0,1,\ldots} { X n } n = 0 , 1 , … has a period d > 1 d>1 d > 1 :
lim n → ∞ P j j ( n d ) d = lim n → ∞ ∑ k = 1 n I I I I { X k = j } n = 1 E ( T j ∣ X 0 = j ) = π j
\frac{\lim_{n\rightarrow\infty} P_{jj}^{(nd)}}{d}
= \lim_{n\rightarrow\infty}\frac{\sum_{k=1}^n\mathbb{I}\mathbb{I}\mathbb{I}\mathbb{I}_{\{X_k=j\}}}{n}
= \quad\frac{1}{\mathbb{E}(T_j|X_0=j)}=\pi_j
d lim n → ∞ P j j ( n d ) = n → ∞ lim n ∑ k = 1 n I I I I { X k = j } = E ( T j ∣ X 0 = j ) 1 = π j
Back to the aperiodic case. SInce the limit lim n → ∞ P i j ( n ) \lim_{n\rightarrow\infty}P_{ij}^{(n)} lim n → ∞ P i j ( n ) does not depend on i i i , lim n → ∞ P i j ( n ) = π j \lim_{n\rightarrow\infty}P_{ij}^{(n)}=\pi_j lim n → ∞ P i j ( n ) = π j is also the limiting(marginal) distribution at state j j j :
lim n → ∞ α n , j = lim n → ∞ P ( X n = j ) = π j \lim_{n\rightarrow\infty}\alpha_{n,j} = \lim_{n\rightarrow\infty}P(X_n=j)=\pi_j
n → ∞ lim α n , j = n → ∞ lim P ( X n = j ) = π j
regardless of the initial distribution α 0 \alpha_0 α 0
Detail:
lim n → ∞ α n , j = lim n → ∞ ( α 0 ⋅ p ( n ) ) j = lim n → ∞ ∑ i ∈ S α 0 , i ⋅ P i j ( n ) = ∑ i ∈ S lim n → ∞ α 0 , i ⋅ P i j ( n ) = ∑ i ∈ S α 0 , i lim n → ∞ ⋅ P i j ( n ) = ( ∑ i ∈ S α 0 , i ) π j = π j \begin{aligned}
\lim_{n\rightarrow\infty}\alpha_{n,j}
& = \lim_{n\rightarrow\infty}(\alpha_0\cdot p^{(n)})_j \\
& = \lim_{n\rightarrow\infty}\sum_{i\in S}\alpha_{0,i}\cdot P_{ij}^{(n)} \\
& = \sum_{i\in S}\lim_{n\rightarrow\infty}\alpha_{0,i}\cdot P_{ij}^{(n)} \\
& = \sum_{i\in S}\alpha_{0,i}\lim_{n\rightarrow\infty}\cdot P_{ij}^{(n)} \\
& = (\sum_{i\in S}\alpha_{0,i})\pi_j \\
& = \pi_j
\end{aligned}
n → ∞ lim α n , j = n → ∞ lim ( α 0 ⋅ p ( n ) ) j = n → ∞ lim i ∈ S ∑ α 0 , i ⋅ P i j ( n ) = i ∈ S ∑ n → ∞ lim α 0 , i ⋅ P i j ( n ) = i ∈ S ∑ α 0 , i n → ∞ lim ⋅ P i j ( n ) = ( i ∈ S ∑ α 0 , i ) π j = π j
Why are the conditions in the Basic Limit Theorem necessary?
Example 4.5.1
Consider a MC with
p = ( 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 ) p=\begin{pmatrix}
\frac{1}{2} & \frac{1}{2} & & \\
\frac{1}{2} & \frac{1}{2} & & \\
& & \frac{1}{2} & \frac{1}{2} \\
& & \frac{1}{2} & \frac{1}{2}
\end{pmatrix} p = ⎝ ⎜ ⎜ ⎛ 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 ⎠ ⎟ ⎟ ⎞
Two classes: { 0 , 1 } , { 2 , 3 } \{0,1\}, \{2, 3\} { 0 , 1 } , { 2 , 3 } ⇒ \Rightarrow ⇒ it is not irreducible. All the states are still aperiodic, positive recurrent
This MC can be decomposed into two MC's:
State 0 , 1 0, 1 0 , 1 , with
p 1 = ( 1 2 1 2 1 2 1 2 ) irreducible p_1=\begin{pmatrix}
\frac{1}{2} & \frac{1}{2} \\\\
\frac{1}{2} & \frac{1}{2}
\end{pmatrix} \quad\quad \text{irreducible} p 1 = ⎝ ⎛ 2 1 2 1 2 1 2 1 ⎠ ⎞ irreducible
State 2 , 3 2, 3 2 , 3 , with
p 1 = ( 1 2 1 2 1 2 1 2 ) irreducible p_1=\begin{pmatrix}
\frac{1}{2} & \frac{1}{2} \\\\
\frac{1}{2} & \frac{1}{2}
\end{pmatrix} \quad\quad \text{irreducible} p 1 = ⎝ ⎛ 2 1 2 1 2 1 2 1 ⎠ ⎞ irreducible
And
p = ( P 1 P 2 ) p=\begin{pmatrix}
P_1 & \\
& P_2
\end{pmatrix} p = ( P 1 P 2 )
Note that both ( 1 2 , 1 2 , 0 , 0 ) (\frac{1}{2},\frac{1}{2},0,0) ( 2 1 , 2 1 , 0 , 0 ) and ( 0 , 0 , 1 2 , 1 2 ) (0,0,\frac{1}{2},\frac{1}{2}) ( 0 , 0 , 2 1 , 2 1 ) are stationary distributions. Consequently, any convex combination of these two distributions, of the form:
a ( 1 2 , 1 2 , 0 , 0 ) + ( 1 − a ) ( 0 , 0 , 1 2 , 1 2 ) , a ∈ { 0 , 1 } a(\frac{1}{2},\frac{1}{2},0,0) + (1-a)(0,0,\frac{1}{2},\frac{1}{2})\quad, a\in\{0,1\}
a ( 2 1 , 2 1 , 0 , 0 ) + ( 1 − a ) ( 0 , 0 , 2 1 , 2 1 ) , a ∈ { 0 , 1 }
is also a stationary distribution
Thus, irreducibility is related to the uniqueness of the stationary distribution.
Correspondingly, the limiting transition probability will depend on i i i :
lim n → ∞ P 00 ( n ) = ( lim n → ∞ P 1 n ) 00 = 1 2 \lim_{n\rightarrow\infty}P_{00}^{(n)} = (\lim_{n\rightarrow\infty}P_1^n)_{00}=\frac{1}{2}
n → ∞ lim P 0 0 ( n ) = ( n → ∞ lim P 1 n ) 0 0 = 2 1
but lim n → ∞ P 20 ( n ) = 0 \lim_{n\rightarrow\infty}P_{20}^{(n)}=0 lim n → ∞ P 2 0 ( n ) = 0
Example 4.5.2
Consider a MC with
P = ( 0 1 1 0 ) P = \begin{pmatrix}
0 & 1\\
1 & 0
\end{pmatrix} P = ( 0 1 1 0 )
Irreducible, positive recurrent, but not aperiodic:d = 2 d=2 d = 2
Note that P 2 = ( 1 1 ) = I ⇒ P 2 n = ( 1 1 ) , P 2 n + 1 = P = ( 1 1 ) P^2=\begin{pmatrix}1&\\&1\end{pmatrix}=I\Rightarrow P^{2n}=\begin{pmatrix}1&\\&1\end{pmatrix}, P^{2n+1}=P=\begin{pmatrix}&1\\1\end{pmatrix} P 2 = ( 1 1 ) = I ⇒ P 2 n = ( 1 1 ) , P 2 n + 1 = P = ( 1 1 )
P 00 ( n ) = 1 P_{00}^{(n)}=1 P 0 0 ( n ) = 1 for n n n even, 0 0 0 for n n n odd ⇒ \Rightarrow ⇒ lim n → ∞ P 00 ( n ) \lim_{n\rightarrow\infty}P_{00}^{(n)} lim n → ∞ P 0 0 ( n ) does not exist.
Aperiodicity is related to the existence of the limit lim n → ∞ P i j ( n ) \lim_{n\rightarrow\infty}P_{ij}^{(n)} lim n → ∞ P i j ( n )
Example 4.5.3
P 0 , j = p j , j = 0 , 1 , ⋯   , p 0 > 0 P i , i − 1 = 1 , i ≥ 1 P_{0,j} = p_j, j=0,1,\cdots, p_0>0 \\
P_{i,i-1}=1, i\geq 1 P 0 , j = p j , j = 0 , 1 , ⋯ , p 0 > 0 P i , i − 1 = 1 , i ≥ 1
Given X 0 = 0 X_0=0 X 0 = 0 , T 0 = n + 1 T_0=n+1 T 0 = n + 1 if and only if X 1 = n X_1=n X 1 = n . his happens with prob p n p_n p n .
⇒ E ( T 0 ∣ X 0 = 0 ) = ∑ n = 0 ∞ ( n + 1 ) p n = 1 + ∑ n = 0 ∞ n p n \begin{aligned}
\Rightarrow\mathbb{E}(T_0|X_0=0)
&=\sum_{n=0}^\infty (n+1)p_n \\
&=1 + \sum_{n=0}^\infty np_n
\end{aligned}
⇒ E ( T 0 ∣ X 0 = 0 ) = n = 0 ∑ ∞ ( n + 1 ) p n = 1 + n = 0 ∑ ∞ n p n
We can construct p n p_n p n such that ∑ n = 0 ∞ n p n = ∞ \sum_{n=0}^\infty np_n=\infty ∑ n = 0 ∞ n p n = ∞ . (For example, p 0 = 1 2 , p 2 = 1 4 , p 4 = 1 4 , ⋯ p_0=\frac{1}{2}, p_2=\frac{1}{4}, p_4=\frac{1}{4},\cdots p 0 = 2 1 , p 2 = 4 1 , p 4 = 4 1 , ⋯ )
In this case, the chain is null recurrent . It is irreducible and aperiodic (P 00 = p 0 > 0 P_{00}=p_0>0 P 0 0 = p 0 > 0 )
A stationary distribution does not exist. Reason:
p = ( p 0 p 1 p 2 ⋯ p i ⋯ 1 0 1 ⋯ 1 ) π ‾ ⋅ P = π ‾ ⇒ p=
\begin{pmatrix}
p_0 & p_1& p_2 & \cdots & p_i & \cdots \\
1 && 0\\
&1\\
\cdots\\
&&&&1
\end{pmatrix} \\
\underline{\pi}\cdot P=\underline{\pi} \Rightarrow
p = ⎝ ⎜ ⎜ ⎜ ⎜ ⎛ p 0 1 ⋯ p 1 1 p 2 0 ⋯ p i 1 ⋯ ⎠ ⎟ ⎟ ⎟ ⎟ ⎞ π ⋅ P = π ⇒
p 0 π 0 + π 1 = π 0 p 1 π 0 + π 2 = π 1 ⋮ p i − 1 π 0 + π i = π i − 1 p i π 0 + π i + 1 = π I \begin{aligned}
&p_0\pi_0 + \pi_1 = \pi_0 \\
&p_1\pi_0 + \pi_2 = \pi_1 \\
&\quad\quad\vdots\\
&p_{i-1}\pi_0 + \pi_i = \pi_{i-1} \\
&p_i\pi_0 + \pi_{i+1} = \pi_I \\
\end{aligned}
p 0 π 0 + π 1 = π 0 p 1 π 0 + π 2 = π 1 ⋮ p i − 1 π 0 + π i = π i − 1 p i π 0 + π i + 1 = π I
Add the first i i i equations:
( p 0 + ⋯ + p i − 1 ) π 0 + ( π 1 + π 2 + ⋯ + π i ) = π 0 + ⋯ + π i − 1 ( p 0 + ⋯ + p i − 1 ) π 0 + π i = π 0 ⇒ π i = ( 1 − ( p o + ⋯ + p i − 1 ) ) π 0 = ∑ k = i ∞ p k π 0 (p_0+\cdots+p_{i-1})\pi_0 + (\cancel{\pi_1}+\cancel{\pi_2}+\cancel{\cdots}+\pi_i) = \pi_0 + \cancel{\cdots} + \cancel{\pi_{i-1}}\\
(p_0+\cdots+p_{i-1})\pi_0 + \pi_i = \pi_0\\
\Rightarrow \pi_i=(1-(p_o+\cdots+p_{i-1}))\pi_0 = \sum_{k=i}^\infty p_k\pi_0
( p 0 + ⋯ + p i − 1 ) π 0 + ( π 1 + π 2 + ⋯ + π i ) = π 0 + ⋯ + π i − 1 ( p 0 + ⋯ + p i − 1 ) π 0 + π i = π 0 ⇒ π i = ( 1 − ( p o + ⋯ + p i − 1 ) ) π 0 = k = i ∑ ∞ p k π 0
Try to normalize:
1 = ∑ i = 1 ∞ π i = ∑ i = 0 ∞ ∑ k i ∞ p k π 0 = ∑ k i ∞ ∑ i = 0 ∞ p k π 0 = ∑ k i ∞ p k ∑ i = 0 ∞ π 0 = ( ∑ k i ∞ ( k + 1 ) p k ⎵ ∞ ) π 0 ⇒ π 0 = 0 , p i i = 0 ∀ i \begin{aligned}
1
&= \sum_{i=1}^\infty \pi_i \\
&= \sum_{i=0}^\infty\sum_{k_i}^\infty p_k\pi_0 \\
&= \sum_{k_i}^\infty \sum_{i=0}^\infty p_k\pi_0 \\
&= \sum_{k_i}^\infty p_k \sum_{i=0}^\infty\pi_0 \\
&= (\underbrace{\sum_{k_i}^\infty (k+1)p_k }_{\infty})\pi_0
\end{aligned}
\\
\Rightarrow \pi_0=0\quad,\quad pi_i=0 \quad\forall i 1 = i = 1 ∑ ∞ π i = i = 0 ∑ ∞ k i ∑ ∞ p k π 0 = k i ∑ ∞ i = 0 ∑ ∞ p k π 0 = k i ∑ ∞ p k i = 0 ∑ ∞ π 0 = ( ∞ k i ∑ ∞ ( k + 1 ) p k ) π 0 ⇒ π 0 = 0 , p i i = 0 ∀ i
This is not a distribution. Thus, a stationary distribution does not exist.
positive recurrence is related to the existence of the stationary distribution
Example 4.5.4. Electron
P = ( 1 − α α β 1 − β ) α , β ∈ ( 0 , 1 ) P=\begin{pmatrix}
1-\alpha & \alpha \\
\beta & 1-\beta
\end{pmatrix}
\quad \alpha,\beta\in(0,1) P = ( 1 − α β α 1 − β ) α , β ∈ ( 0 , 1 )
Irreducible, aperiodic, positive recurrence.
In order to find of P n P^n P n ; we use the diagonalization technique.
P = Q Λ Q − 1 where Λ is diagonal Λ = ( 1 0 0 1 − α − β ) Q = ( 1 α 1 1 − β ) Q − 1 = 1 α + β ( β α 1 − 1 ) P=Q\Lambda Q^{-1} \quad\text{where $\Lambda$ is diagonal}\\\\
\Lambda = \begin{pmatrix}
1 & 0 \\
0 & 1-\alpha-\beta
\end{pmatrix}
\quad
Q=\begin{pmatrix}
1 & \alpha \\
1 & 1-\beta
\end{pmatrix}
\quad
Q^{-1}=\frac{1}{\alpha+\beta}\begin{pmatrix}
\beta & \alpha\\
1 & -1
\end{pmatrix}
P = Q Λ Q − 1 where Λ is diagonal Λ = ( 1 0 0 1 − α − β ) Q = ( 1 1 α 1 − β ) Q − 1 = α + β 1 ( β 1 α − 1 )
Then
P n = ( Q Λ Q − 1 ) ( Q Λ Q − 1 ) ⋯ ( Q Λ Q − 1 ) = Q Λ n Q − 1 = ( 1 α 1 − β ) ( 1 ( 1 − α − β ) n ) 1 α + β ( β α 1 − 1 ) = 1 α + β ( β + α ( 1 − α − β ) n α − α ( 1 − α − β ) n β − β ( 1 − α − β ) n α + β ( 1 − α − β ) n ) ⇒ lim n → ∞ P n = 1 α + β ( β α β α ) = ( β α + β α α + β β α + β α α + β ) \begin{aligned}
P^n
&= (Q\Lambda \cancel{Q}^{-1})(\cancel{Q}\Lambda \cancel{Q}^{-1})\cdots(\cancel{Q}\Lambda Q^{-1}) \\
&= Q\Lambda^nQ^{-1} \\
&= \begin{pmatrix}
1 & \alpha \\
1 & -\beta
\end{pmatrix}
\begin{pmatrix}
1 \\
& (1-\alpha-\beta)^n
\end{pmatrix}
\frac{1}{\alpha+\beta}
\begin{pmatrix}
\beta & \alpha \\
1 & -1
\end{pmatrix}
\\
&=\frac{1}{\alpha+\beta}
\begin{pmatrix}
\beta+\alpha(1-\alpha-\beta)^n & \alpha-\alpha(1-\alpha-\beta)^n \\
\beta-\beta(1-\alpha-\beta)^n & \alpha+\beta(1-\alpha-\beta)^n
\end{pmatrix}\\
&\Rightarrow \lim_{n\rightarrow\infty}P^n=\frac{1}{\alpha+\beta}\begin{pmatrix}
\beta & \alpha \\
\beta & \alpha
\end{pmatrix}
=\begin{pmatrix}
\frac{\beta}{\alpha+\beta} & \frac{\alpha}{\alpha+\beta} \\ \\
\frac{\beta}{\alpha+\beta} & \frac{\alpha}{\alpha+\beta}
\end{pmatrix}
\end{aligned} P n = ( Q Λ Q − 1 ) ( Q Λ Q − 1 ) ⋯ ( Q Λ Q − 1 ) = Q Λ n Q − 1 = ( 1 1 α − β ) ( 1 ( 1 − α − β ) n ) α + β 1 ( β 1 α − 1 ) = α + β 1 ( β + α ( 1 − α − β ) n β − β ( 1 − α − β ) n α − α ( 1 − α − β ) n α + β ( 1 − α − β ) n ) ⇒ n → ∞ lim P n = α + β 1 ( β β α α ) = ⎝ ⎜ ⎛ α + β β α + β β α + β α α + β α ⎠ ⎟ ⎞
Note that lim n → ∞ P n \lim_{n\rightarrow\infty} P^n lim n → ∞ P n has identical rows. This corresponds to the result that lim n → ∞ P i j ( n ) \lim_{n\rightarrow\infty}P_{ij}^{(n)} lim n → ∞ P i j ( n ) does not depend on i i i
We saw that the stationary distribution π ‾ = ( β α + β , α α + β ) \underline{\pi}=(\frac{\beta}{\alpha+\beta},\frac{\alpha}{\alpha+\beta}) π = ( α + β β , α + β α ) . So we verity that π j = lim n → ∞ P i j ( n ) \pi_j=\lim_{n\rightarrow\infty} P_{ij}^{(n)} π j = lim n → ∞ P i j ( n )
Also, given X 0 = 0 X_0=0 X 0 = 0 , P ( T 0 = 1 ∣ X 0 = 0 ) = 1 − α \mathbb{P}(T_0=1|X_0=0)=1-\alpha P ( T 0 = 1 ∣ X 0 = 0 ) = 1 − α .
For k = 2 , 3 , ⋯ k=2,3,\cdots k = 2 , 3 , ⋯
P ( T 0 = k ∣ X 0 = 0 ) = P ( X k = 0 , X k − 1 = 1 , ⋯   , X 1 = 1 ∣ X 0 = 0 ) = α ( 1 − β ) k − 2 β ⇒ E ( T 0 ∣ X 0 = 0 ) = 1 ⋅ ( 1 − α ) + ∑ k = 2 ∞ α ( 1 − β ) k − 2 β k = 1 − α + ∑ k = 1 ∞ α ( 1 − β ) k − 2 β ( k − 1 ) ⎵ E ( G e o ( β ) ) + ∑ k = 2 ∞ α ( 1 − β ) k − 2 β ⎵ pmf of Geo( β ) = 1 − α + α ∑ k = 1 ∞ ( 1 − β ) k − 2 β ( k − 1 ) + ∑ k = 2 ∞ α ( 1 − β ) k − 2 β = 1 α + α ⋅ 1 β + α ⋅ 1 = 1 − α + α β + α = α + β β \begin{aligned}
\mathbb{P}(T_0=k|X_0=0) &= \mathbb{P}(X_k=0,X_{k-1}=1,\cdots, X_1=1|X_0=0) \\
&=\alpha(1-\beta)^{k-2}\beta\\
&\Rightarrow \mathbb{E}(T_0|X_0=0) \\
&=1\cdot(1-\alpha)+\sum_{k=2}^\infty \alpha(1-\beta)^{k-2}\beta k \\
&=1-\alpha+\sum_{k=1}^\infty\underbrace{\alpha(1-\beta)^{k-2}\beta(k-1)}_{\mathbb{E}(Geo(\beta))}+\sum_{k=2}^\infty\alpha\underbrace{(1-\beta)^{k-2}\beta}_{\text{pmf of Geo($\beta$)}}\\
&=1-\alpha+\alpha\sum_{k=1}^\infty(1-\beta)^{k-2}\beta(k-1)+\sum_{k=2}^\infty\alpha(1-\beta)^{k-2}\beta\\
&=1\alpha+\alpha\cdot\frac{1}{\beta}+\alpha\cdot 1 \\
&=1-\cancel{\alpha}+\frac{\alpha}{\beta}+\cancel{\alpha}\\
&=\frac{\alpha+\beta}{\beta}
\end{aligned} P ( T 0 = k ∣ X 0 = 0 ) = P ( X k = 0 , X k − 1 = 1 , ⋯ , X 1 = 1 ∣ X 0 = 0 ) = α ( 1 − β ) k − 2 β ⇒ E ( T 0 ∣ X 0 = 0 ) = 1 ⋅ ( 1 − α ) + k = 2 ∑ ∞ α ( 1 − β ) k − 2 β k = 1 − α + k = 1 ∑ ∞ E ( G e o ( β ) ) α ( 1 − β ) k − 2 β ( k − 1 ) + k = 2 ∑ ∞ α pmf of Geo( β ) ( 1 − β ) k − 2 β = 1 − α + α k = 1 ∑ ∞ ( 1 − β ) k − 2 β ( k − 1 ) + k = 2 ∑ ∞ α ( 1 − β ) k − 2 β = 1 α + α ⋅ β 1 + α ⋅ 1 = 1 − α + β α + α = β α + β
Hence we verify that E ( T 0 ∣ X 0 = 0 ) = 1 π 0 \mathbb{E}(T_0|X_0=0)=\frac{1}{\pi_0} E ( T 0 ∣ X 0 = 0 ) = π 0 1
4.6. Generating function and branching processes
Definition 4.6.1
Let p ‾ = ( p 0 , p 1 , ⋯   ) \underline{p}=(p_0,p_1,\cdots) p = ( p 0 , p 1 , ⋯ ) be a distribution on { 0 , 1 , 2 , ⋯   } \{0,1,2,\cdots\} { 0 , 1 , 2 , ⋯ } . Let ξ \xi ξ be a r.v. following distribution p ‾ \underline{p} p . That is P ( ξ = i ) = p i \mathbb{P}(\xi=i)=p_i P ( ξ = i ) = p i . Then the generating function of ξ \xi ξ , or of p ‾ \underline{p} p , is defined by
ψ ( s ) = E ( s ξ ) = ∑ k = 0 ∞ p k s k f o r 0 ≤ s ≤ 1 \begin{aligned}
\psi(s)&=\mathbb{E}(s^\xi) \\
&=\sum_{k=0}^\infty p_ks^k\quad\quad for 0\leq s\leq 1
\end{aligned} ψ ( s ) = E ( s ξ ) = k = 0 ∑ ∞ p k s k f o r 0 ≤ s ≤ 1
Properties of generating function
ψ ( 0 ) = p 0 , ψ ( 1 ) = ∑ k = 0 ∞ p k = 1 \psi(0)=p_0,\quad\psi(1)=\sum_{k=0}^\infty p_k=1 ψ ( 0 ) = p 0 , ψ ( 1 ) = ∑ k = 0 ∞ p k = 1
Generating function determines the distribution
p k = 1 k ! d k ψ ( s ) d s k ∣ s = 0 p_k=\frac{1}{k!}\frac{d^k\psi(s)}{ds^k}|_{s=0}
p k = k ! 1 d s k d k ψ ( s ) ∣ s = 0
Reason:
ψ ( s ) = p 0 + p 1 s 1 + ⋯ + p k − 1 s k − 1 + p k s k + p k + 1 s k + 1 + ⋯ \psi(s)=p_0+p_1s^1+\cdots+p_{k-1}s^{k-1}+p_ks^k+p_{k+1}s^{k+1}+\cdots \\
ψ ( s ) = p 0 + p 1 s 1 + ⋯ + p k − 1 s k − 1 + p k s k + p k + 1 s k + 1 + ⋯
d k ψ ( s ) d s k = k ! p k + ( ⋯   ) s + ( ⋯   ) s 2 + ⋯ \frac{d^k\psi(s)}{ds^k}=k!p_k +(\cdots)s +(\cdots)s^2+\cdots
d s k d k ψ ( s ) = k ! p k + ( ⋯ ) s + ( ⋯ ) s 2 + ⋯
d k ψ ( s ) d s k ∣ s = 0 = k ! p k \frac{d^k\psi(s)}{ds^k}|_{s=0}=k!p_k
d s k d k ψ ( s ) ∣ s = 0 = k ! p k
In particular, p 1 ≥ 0 ⇒ ψ ( s ) p_1\geq 0\Rightarrow\psi(s) p 1 ≥ 0 ⇒ ψ ( s ) is increasing. p 2 ≥ 0 ⇒ ψ ( s ) p_2\geq 0\Rightarrow \psi(s) p 2 ≥ 0 ⇒ ψ ( s ) is climax
Let ξ 1 , . . . , ξ n \xi_1,...,\xi_n ξ 1 , . . . , ξ n be independent r.b. with generating function ψ 1 , . . . , ψ n \psi_1,...,\psi_n ψ 1 , . . . , ψ n ,
X = ξ 1 + . . . + ξ n ⇒ ψ X ( s ) = ψ 1 ( s ) ψ 2 ( s ) . . . ψ n ( s ) X=\xi_1+...+\xi_n \Rightarrow \psi_X(s)=\psi_1(s)\psi_2(s)...\psi_n(s)
X = ξ 1 + . . . + ξ n ⇒ ψ X ( s ) = ψ 1 ( s ) ψ 2 ( s ) . . . ψ n ( s )
Proof :
ψ X ( s ) = s X ( i n d e p e n d e n t ) = E ( s ξ 1 s ξ 2 . . . s ξ n ) = E ( s ξ 1 ) . . . E ( s ξ n ) = ψ 1 ( s ) . . . ψ n ( s ) \begin{aligned}
\psi_X(s) &= \mathbb{s^X} \\
(independent) &= \mathbb{E}(s^{\xi_1}s^{\xi_2}...s^{\xi_n}) \\
&= \mathbb{E}(s^{\xi_1})...\mathbb{E}(s^{\xi_n})\\
&= \psi_1(s)...\psi_n(s)
\end{aligned} ψ X ( s ) ( i n d e p e n d e n t ) = s X = E ( s ξ 1 s ξ 2 . . . s ξ n ) = E ( s ξ 1 ) . . . E ( s ξ n ) = ψ 1 ( s ) . . . ψ n ( s )
d ψ ( s ) d s k ∣ s = 1 = d k E ( s ξ ) d s k ∣ s = 1 = E ( d k s ξ d s k ∣ s = 1 = E ( ξ ( ξ − 1 ) ( ξ − 2 ) . . . ( ξ − k + 1 ) s ξ − k ) ∣ s = 1 = E ( ξ ( ξ − 1 ) . . . ( ξ − k + 1 ) ) \frac{d^\psi(s)}{ds^k}\bigg|_{s=1} =
\frac{d^k\mathbb{E}(s^\xi)}{ds^k}\bigg|_{s=1} =
\mathbb{E}(\frac{d^ks^\xi}{ds^k}\bigg|_{s=1} =
\mathbb{E}(\xi(\xi-1)(\xi-2)...(\xi-k+1)s^{\xi-k})\bigg|_{s=1} =
\mathbb{E}(\xi(\xi-1)...(\xi-k+1)) d s k d ψ ( s ) ∣ ∣ ∣ ∣ s = 1 = d s k d k E ( s ξ ) ∣ ∣ ∣ ∣ s = 1 = E ( d s k d k s ξ ∣ ∣ ∣ ∣ s = 1 = E ( ξ ( ξ − 1 ) ( ξ − 2 ) . . . ( ξ − k + 1 ) s ξ − k ) ∣ ∣ ∣ ∣ s = 1 = E ( ξ ( ξ − 1 ) . . . ( ξ − k + 1 ) )
In particular, E ( ξ ) = ψ ′ ( 1 ) \mathbb{E}(\xi) = \psi'(1) E ( ξ ) = ψ ′ ( 1 ) and V a r ( ξ ) = E ( ξ 2 ) − ( E ( ξ ) ) 2 = E ( ξ 2 − ξ ) + E ( ξ ) − ( E ( ξ ) ) 2 = ψ ′ ′ ( 1 ) + ψ ( 1 ) − ( ψ ′ ( 1 ) ) 2 Var(\xi)=\mathbb{E}(\xi^2)-(\mathbb{E}(\xi))^2=\mathbb{E}(\xi^2-\xi)+\mathbb{E}(\xi)-(\mathbb{E}(\xi))^2 = \psi''(1)+\psi(1)-(\psi'(1))^2 V a r ( ξ ) = E ( ξ 2 ) − ( E ( ξ ) ) 2 = E ( ξ 2 − ξ ) + E ( ξ ) − ( E ( ξ ) ) 2 = ψ ′ ′ ( 1 ) + ψ ( 1 ) − ( ψ ′ ( 1 ) ) 2
Graph of a g.f.:
4.6.1. Branching Process
Each organism, at the end of its life, produces a random number Y Y Y of offsprings.
P ( Y = k ) = P k , k = 0 , 1 , 2 , . . . , P k ≥ 0 , ∑ k = 0 ∞ P k = 1 \mathbb{P}(Y=k)=P_k, \quad k=0,1,2,..., \quad P_k\geq 0,\quad \sum_{k=0}^\infty P_k=1
P ( Y = k ) = P k , k = 0 , 1 , 2 , . . . , P k ≥ 0 , k = 0 ∑ ∞ P k = 1
The number of offsprings of different individuals are independent.
Start from one ancestor X 0 = 1 X_0=1 X 0 = 1 , X n : X_n: X n : # of individuals(population in n n n -th generation)
Then X n + 1 = Y 1 ( n ) + Y 2 ( n ) + . . . + Y X n ( n ) X_n+1=Y_1^{(n)}+Y_2^{(n)}+...+Y_{X_n}^{(n)} X n + 1 = Y 1 ( n ) + Y 2 ( n ) + . . . + Y X n ( n ) , where Y 1 ( n ) , . . . , Y X n ( n ) Y_1^{(n)},...,Y_{X_n}^{(n)} Y 1 ( n ) , . . . , Y X n ( n ) are independent copies of Y , Y i ( n ) Y, Y_i^{(n)} Y , Y i ( n ) is the number of offsprings of the i i i -th individual in the n n n -th generation
4.6.1.1. Mean and Variance
Mean: E ( X n ) \mathbb{E}(X_n) E ( X n ) and Variance: V a r ( X n ) Var(X_n) V a r ( X n )
Assume, E ( Y ) = μ , V a r ( Y ) = σ 2 \mathbb{E}(Y)=\mu, Var(Y)=\sigma^2 E ( Y ) = μ , V a r ( Y ) = σ 2 .
E ( X n + 1 ) = E ( Y 1 ( n ) + . . . + Y X n ( n ) ) = E ( E ( Y 1 ( n ) + . . . + Y X n ( n ) ∣ X n ) ) = E ( X n μ ) Wald’s identity(tutorial 3) = μ E ( X n ) \begin{aligned}
\mathbb{E}(X_{n+1})
&= \mathbb{E}(Y_1^{(n)}+...+Y_{X_n}^{(n)}) \\
&= \mathbb{E}(\mathbb{E}(Y_1^{(n)}+...+Y_{X_n}^{(n)}|X_n)) \\
&= \mathbb{E}(X_n\mu) \\
\text{Wald's identity(tutorial 3)} \quad &= \mu\mathbb{E}(X_n)\\
\end{aligned} E ( X n + 1 ) Wald’s identity(tutorial 3) = E ( Y 1 ( n ) + . . . + Y X n ( n ) ) = E ( E ( Y 1 ( n ) + . . . + Y X n ( n ) ∣ X n ) ) = E ( X n μ ) = μ E ( X n )
⇒ E ( X n ) = μ E ( X n − 1 ) = μ 2 E ( X n − 2 ) ⋮ = μ n E ( X 0 ) = μ n , n = 0 , 1 , . . . \begin{aligned}
\Rightarrow \mathbb{E}(X_n)
&=\mu\mathbb{E}(X_{n-1}) \\
&=\mu^2\mathbb{E}(X_{n-2}) \\
&\quad\quad\quad\vdots \\
&=\mu^n\mathbb{E}(X_0) = \mu^n,\quad n=0,1,...
\end{aligned}
⇒ E ( X n ) = μ E ( X n − 1 ) = μ 2 E ( X n − 2 ) ⋮ = μ n E ( X 0 ) = μ n , n = 0 , 1 , . . .
V a r ( X n + 1 ) = E ( V a r ( X n + 1 ∣ X n ) + V a r ( E ) X n + 1 ∣ X n ) E ( V a r ( X n + 1 ∣ X n ) ) = E ( V a r ( Y 1 ( n ) + . . . + Y X N ( N ) ∣ X N ) ) = E ( X n ⋅ σ 2 ) = σ 2 μ n V a r ( E ( X n + 1 ∣ X n ) ) = V a r ( μ X n ) = μ 2 V a r ( X u ) ⇒ V a r ( X n + 1 ) = σ 2 μ n + μ 2 V a r ( X n ) ) V a r ( X 1 ) = σ 2 V a r ( X 2 ) = σ 2 μ + μ 2 σ 2 = σ 2 ( μ 1 + μ 2 ) V a r ( X 3 ) = σ 2 μ 2 + μ 2 ( σ 2 ( μ 1 + μ 2 ) ) = σ 2 ( μ 2 + μ 3 + μ 4 ) ⋮ In general, (can be proved by induction) V a r ( X n ) = σ ( μ n − 1 + . . . + μ 2 n − 2 ) = { σ 2 μ n − 1 1 − μ n 1 − μ μ = 1 σ 2 n μ = 1 \begin{aligned}
Var(X_{n+1})
&= \mathbb{E}(Var(X_{n+1}|X_n)+Var(\mathbb{E})X_{n+1}|X_n) \\ \\
\begin{aligned}
\mathbb{E}(Var(X_{n+1}|X_n))
&=\mathbb{E}(Var(Y_1^{(n)+...+Y_{X_N}^{(N)}}|X_N))\\
&=\mathbb{E}(X_n\cdot\sigma^2) \\
&= \sigma^2\mu^n
\end{aligned}\quad\quad
&\begin{aligned}
Var(\mathbb{E}(X_{n+1}|X_n))
&= Var(\mu X_n) \\
&= \mu^2Var(X_u)\ \\
&\begin{aligned}
\Rightarrow
&Var(X_{n+1}) = \sigma^2\mu^n+\mu^2Var(X_n))\\
&Var(X_1)=\sigma^2 \\
&Var(X_2)=\sigma^2\mu + \mu^2\sigma^2=\sigma^2(\mu^1+\mu^2) \\
&Var(X_3)=\sigma^2\mu^2+\mu^2(\sigma^2(\mu^1+\mu^2)) = \sigma^2(\mu^2 + \mu^3 + \mu^4)\\
&\quad\quad\quad\vdots\\
&\text{In general, (can be proved by induction)}\\
&\begin{aligned}
Var(X_n)&=\sigma(\mu^{n-1}+...+\mu^{2n-2})\\
&=\begin{cases}
\begin{aligned}
&\sigma^2\mu^{n-1}\frac{1-\mu^n}{1-\mu} \quad&\mu\cancel{=}1\\
&\sigma^2n &\mu=1
\end{aligned}
\end{cases}
\end{aligned}
\end{aligned}
\end{aligned}
\end{aligned}
V a r ( X n + 1 ) E ( V a r ( X n + 1 ∣ X n ) ) = E ( V a r ( Y 1 ( n ) + . . . + Y X N ( N ) ∣ X N ) ) = E ( X n ⋅ σ 2 ) = σ 2 μ n = E ( V a r ( X n + 1 ∣ X n ) + V a r ( E ) X n + 1 ∣ X n ) V a r ( E ( X n + 1 ∣ X n ) ) = V a r ( μ X n ) = μ 2 V a r ( X u ) ⇒ V a r ( X n + 1 ) = σ 2 μ n + μ 2 V a r ( X n ) ) V a r ( X 1 ) = σ 2 V a r ( X 2 ) = σ 2 μ + μ 2 σ 2 = σ 2 ( μ 1 + μ 2 ) V a r ( X 3 ) = σ 2 μ 2 + μ 2 ( σ 2 ( μ 1 + μ 2 ) ) = σ 2 ( μ 2 + μ 3 + μ 4 ) ⋮ In general, (can be proved by induction) V a r ( X n ) = σ ( μ n − 1 + . . . + μ 2 n − 2 ) = ⎩ ⎪ ⎨ ⎪ ⎧ σ 2 μ n − 1 1 − μ 1 − μ n σ 2 n μ = 1 μ = 1
4.6.1.2. Extinction Probability
Q: What is the probability that the population size is eventually reduced to 0
Note that for a branching process, X n = 0 ⇒ X k = 0 X_n=0\Rightarrow X_k=0 X n = 0 ⇒ X k = 0 for all k ≥ n k\geq n k ≥ n . Thus, state 0 0 0 is absorbing. ( P 00 = 1 ) (P_{00}=1) ( P 0 0 = 1 ) . Let N N N be the time that extinction happens.
N = m i n { n : X n = 0 } N=min\{n:X_n=0\}
N = m i n { n : X n = 0 }
Define
U n = P ( N ≤ n ⎵ extinction happens before or at n ) = P ( X n = 0 ) U_n=\mathbb{P}(\underbrace{N\leq n}_{\text{extinction happens}\atop\text{before or at n}})=\mathbb{P}(X_n=0)
U n = P ( before or at n extinction happens N ≤ n ) = P ( X n = 0 )
Then U n U_n U n is increasing in n n n , and
u = lim n → ∞ U n = P ( N < ∞ ) = P ( the extinction eventually happens ) = extinction probability \begin{aligned}
u=\lim_{n\rightarrow\infty}U_n
&= \mathbb{P}(N<\infty) \\
&= P(\text{the extinction eventually happens}) \\
&= \text{extinction probability}
\end{aligned}
u = n → ∞ lim U n = P ( N < ∞ ) = P ( the extinction eventually happens ) = extinction probability
Out goal : find u u u
We have the following relation between U n U_n U n and U n − 1 U_{n-1} U n − 1 :
U n = ∑ k = 0 ∞ P k ( U n − 1 ) k = ψ ⎵ gf of Y ( U n − 1 ) U_n=\sum_{k=0}^\infty P_k(U_{n-1})^k = \underbrace{\psi}_{\text{gf of Y}}(U_{n-1})
U n = k = 0 ∑ ∞ P k ( U n − 1 ) k = gf of Y ψ ( U n − 1 )
Each subpopulation has the same distribution as the whole population.
Total population dies out in n n n steps if and only if each subpopulation dies out int n − 1 n-1 n − 1 steps
U n = P ( N ≤ n ) = ∑ k P ( N ≤ n ∣ X 1 = k ) P ( X 1 = k ) ⎵ = P k = ∑ k P ( N 1 ≤ n − 1 ⎵ # of steps for subpopulation 1 to die out , ⋯   , N k ≤ n − 1 ∣ X 1 = k ) ⋅ P k = ∑ k P k ⋅ U n − 1 k = E ( U n − 1 Y ) = ψ ( U n − 1 ) \begin{aligned}
U_n
&= \mathbb{P}(N\leq n) \\
&= \sum_k \mathbb{P}(N\leq n|X_1 = k)\underbrace{\mathbb{P}(X_1=k)}_{=P_k} \\
&=\sum_k \mathbb{P}(\underbrace{N_1\leq n-1}_{\text{\# of steps for subpopulation 1 to die out}},\cdots, N_k\leq n-1|X_1=k)\cdot P_k \\
&= \sum_k P_k\cdot U_{n-1}^k \\
&= \mathbb{E}(U_{n-1}^Y) \\
&= \psi(U_{n-1})
\end{aligned} U n = P ( N ≤ n ) = k ∑ P ( N ≤ n ∣ X 1 = k ) = P k P ( X 1 = k ) = k ∑ P ( # of steps for subpopulation 1 to die out N 1 ≤ n − 1 , ⋯ , N k ≤ n − 1 ∣ X 1 = k ) ⋅ P k = k ∑ P k ⋅ U n − 1 k = E ( U n − 1 Y ) = ψ ( U n − 1 )
Thus, the question is :
\quad With initial value U 0 = 0 U_0=0 U 0 = 0 (since X 0 = 1 X_0=1 X 0 = 1 ), relation U n = ψ ( U n − 1 ) U_n=\psi(U_{n-1}) U n = ψ ( U n − 1 ) . What is lim n → ∞ U N = u \lim_{n\rightarrow\infty}U_N=u lim n → ∞ U N = u ?
Recall that we have
ψ ( 0 ) = P 0 ≥ 0 \psi(0)=P_0\geq 0 ψ ( 0 ) = P 0 ≥ 0
ψ ( 1 ) = 1 \psi(1) = 1 ψ ( 1 ) = 1
ψ ( s ) \psi(s) ψ ( s ) is increasing
ψ ( s ) \psi(s) ψ ( s ) is convex
Draw ψ ( s ) \psi(s) ψ ( s ) and function f ( s ) = s f(s)=s f ( s ) = s between 0 and 1, we have two cases:
The extinction probability u u u will be the smallest intersection of ψ ( s ) \psi(s) ψ ( s ) and f ( s ) f(s) f ( s ) . Equivalently, it is the smallest solution of the equation ψ ( s ) = s \psi(s)=s ψ ( s ) = s between 0 and 1.
Draw ψ ( s ) \psi(s) ψ ( s ) and function f ( s ) = s f(s)=s f ( s ) = s between 0 and 1, we have two cases:
Reason : See the dynamics on a graph
⇒ \Rightarrow ⇒ Case 1 : u < 1 Case 2 : u = 1 (extinction happens for sure.) \begin{aligned}
&\text{Case } 1: &u < 1 \\
&\text{Case } 2: &u=1 & \text{ (extinction happens for sure.)}
\end{aligned} Case 1 : Case 2 : u < 1 u = 1 (extinction happens for sure.) .
Q: How to tell if we are in case 1 or in case ?
A: check ψ ′ ( 1 ) = E ( Y ) \psi'(1)=\mathbb{E}(Y) ψ ′ ( 1 ) = E ( Y )
ψ ′ 1 ( 1 ) > 1 → Case 1 ψ ′ 1 ( 1 ) ≤ 1 → Case 2 \begin{aligned}
& \psi'1(1) > 1 &\rightarrow &\text{ Case }1\\
& \psi'1(1) \leq 1 &\rightarrow &\text{ Case }2
\end{aligned}
ψ ′ 1 ( 1 ) > 1 ψ ′ 1 ( 1 ) ≤ 1 → → Case 1 Case 2
Thus, we conclude:
E ( Y ) > 1 \mathbb{E}(Y) > 1 E ( Y ) > 1 : an average more than 1 offspring
⇒ \Rightarrow ⇒ extinction with certain probability smaller than 1. u u u is the smallest/unique solution between 0 and 1 of ψ ( s ) = s \psi(s) = s ψ ( s ) = s
E ( Y ) ≤ 1 \mathbb{E}(Y) \leq 1 E ( Y ) ≤ 1 : an average less than or equal to 1 offspring
⇒ \Rightarrow ⇒ extinction happens for sure (with probability 1)
5. Poisson Processes
5.1. Counting Process
DTMC is a discrete-time process. That is, the index set T = { 0 , 1 , 2 , . . . } T=\{0,1,2,...\} T = { 0 , 1 , 2 , . . . } and { X n } n = 01 , 2 , 3 , ⋯ \{X_n\}_{n=01,2,3,\cdots} { X n } n = 0 1 , 2 , 3 , ⋯ .
We also want to consider the cases where time can be continuous,
Continuous-time processes: T = [ 0 , ∞ } \text{Continuous-time processes: } T=[0,\infty\}
Continuous-time processes: T = [ 0 , ∞ }
{ X t } t ≥ 0 or { X ( t ) } t ≥ 0 \{X_t\}_{t\geq 0} \text{ or } \{X_{(t)}\}_{t\geq 0}
{ X t } t ≥ 0 or { X ( t ) } t ≥ 0
The simplest type of continuous-time process is counting process, which counts the number of occurrence of certain event before time t t t .
Definition 5.1.1. Counting Process N ( t ) N(t) N ( t )
Let 0 ≤ S 1 ≤ S 2 ≤ ⋯ 0\leq S_1\leq S_2\leq\cdots 0 ≤ S 1 ≤ S 2 ≤ ⋯ be the time of occurrence of some events. Then, the process
N ( t ) : = # { n : S n ≤ t } = ∑ n = 1 ∞ 1 ​ ​ ​ ⊥ { S n ≤ t } \begin{aligned}
N(t) &:= \#\{n:S_n\leq t\} \\
&=\sum_{n=1}^\infty 1\!\!\!\perp_{\{S_n\leq t\}}
\end{aligned} N ( t ) : = # { n : S n ≤ t } = n = 1 ∑ ∞ 1 ⊥ { S n ≤ t }
is called the counting process (of the events { S n } n = 1 , 2 , . . . \{S_n\}_{n=1,2,...} { S n } n = 1 , 2 , . . . )
Equivalently, N ( t ) = n    ⟺    S n ≤ t < S n + 1 N(t) = n \iff S_n\leq t< S_{n+1} N ( t ) = n ⟺ S n ≤ t < S n + 1
Example 5.1.1
Calls arrive at a call center.
S n S_n S n : arrival time of the n n n -th call
N ( t ) N(t) N ( t ) : the number of calls received before time t t t
Other examples: cars passing a speed reader, atoms having radioactive decay, ...
Properties of a counting process
N ( t ) ≥ 0 , t ≥ 0 N(t)\geq 0, t\geq 0 N ( t ) ≥ 0 , t ≥ 0
N ( t ) N(t) N ( t ) takes integer values
N ( t ) N(t) N ( t ) is increasing.
N ( t 1 ) ≤ N ( t 2 ) N(t_1)\leq N(t_2) N ( t 1 ) ≤ N ( t 2 ) if t 1 ≤ t 2 t_1\leq t_2 t 1 ≤ t 2
N ( t ) N(t) N ( t ) is right-continuous
N ( t ) = lim s ↓ t N ( s ) N(t) = \lim_{s\downarrow t}N(s) N ( t ) = lim s ↓ t N ( s )
We also assume:
N ( 0 ) = 0 N(0) = 0 N ( 0 ) = 0 (No event happens at time 0)
N ( t ) N(t) N ( t ) only has jumps at size 1.
(No two events happen at exactly the same time)
5.2. Definition of Poisson Process
Interarrival Times
W 1 , W 2 , … W_1, W_2, \ldots W 1 , W 2 , …
W 1 = S 1 W_1=S_1 W 1 = S 1
W n = S n − S n − 1 W_n=S_n-S_{n-1} W n = S n − S n − 1 : interarrival time between n − 1 n-1 n − 1 -th and the n n n -th event
Definition 5.2.1. Renewal Process
A renewal process is a counting process for which the interarrival times W 1 , W 2 , . . . W_1, W_2, ... W 1 , W 2 , . . . are independent and identical
ALl the three processes examples of counting processes can be reasonably modeled as renewal processes.
Definition 5.2.2. Poisson Process
Poisson Process { X ( t ) } t ≥ 0 \{X_{(t)}\}_{t\geq0} { X ( t ) } t ≥ 0 is a renewal process for which the interarrival times are exponentially distributed:
W n ∼ i . i . d E x p ( λ ) W_n \stackrel{i.i.d}{\sim} Exp(\lambda)
W n ∼ i . i . d E x p ( λ )
A Poisson process { N ( t ) } t ≥ 0 \{N(t)\}_{t\geq 0} { N ( t ) } t ≥ 0 can be denoted as
{ N ( t ) } ∼ P o i ( λ ⎵ i n t e n s i t y t ) \{N(t)\}\sim Poi(\underbrace{\lambda}_{intensity} t)
{ N ( t ) } ∼ P o i ( i n t e n s i t y λ t )
Recall : Properties of the Exponential Distributions
X ∼ E x p ( λ ) X\sim Exp(\lambda) X ∼ E x p ( λ )
Basic properties
pdf: f ( x ) = λ e − λ x ( x > 0 ) f(x)=\lambda e^{-\lambda x}\quad(x>0) f ( x ) = λ e − λ x ( x > 0 )
cdf: F ( x ) = 1 − e − λ x F(x)=1-e^{-\lambda x} F ( x ) = 1 − e − λ x
E ( x ) = 1 λ \mathbb{E}(x)=\frac{1}{\lambda} E ( x ) = λ 1
V a r ( X ) = 1 λ 2 Var(X)=\frac{1}{\lambda^2} V a r ( X ) = λ 2 1
Memoryless property
P ( X > s + t ∣ x > s ) = P x > t \mathbb{P}(X>s+t|x>s) = \mathbb{P}{x>t} P ( X > s + t ∣ x > s ) = P x > t
Min ofexponentials
X 1 , . . . , X n X_1,...,X_n X 1 , . . . , X n independent, X i ∼ E x p ( λ i ) X_i\sim Exp(\lambda_i) X i ∼ E x p ( λ i ) , then
m i n ( X 1 , ⋯   , X n ) ∼ E x p ( λ 1 + ⋯ + λ n ) min(X_1, \cdots, X_n)\sim Exp(\lambda_1+\cdots+\lambda_n)\\ m i n ( X 1 , ⋯ , X n ) ∼ E x p ( λ 1 + ⋯ + λ n )
Proof : it suffices to prove the result for n = 2 n=2 n = 2 Let Z = m i n ( X 1 , X 2 ) , then \text{Let } Z=min(X_1, X_2), \text{ then}
Let Z = m i n ( X 1 , X 2 ) , then
P ( Z > z ) = P ( X 1 > z , X 2 > z ) = P ( X 1 > z ) ⋅ P ( X 2 > z ) = e − λ 1 z ⋅ x − λ 2 z = e − ( λ 1 + λ 2 ) z \begin{aligned}
\mathbb{P}(Z>z)
&= \mathbb{P}(X_1>z, X_2>z)\\
&= \mathbb{P}(X_1>z)\cdot \mathbb{P}(X_2>z) \\
&= e^{-\lambda_1z}\cdot x^{-\lambda_2z} \\
&= e^{-(\lambda_1+\lambda_2)z}
\end{aligned} P ( Z > z ) = P ( X 1 > z , X 2 > z ) = P ( X 1 > z ) ⋅ P ( X 2 > z ) = e − λ 1 z ⋅ x − λ 2 z = e − ( λ 1 + λ 2 ) z
⇒ P ( Z ≤ z ) = 1 − e − ( λ 1 + λ 2 ) z ⎵ cdf of E x p ( λ 1 + λ 2 ) z > 0 \Rightarrow\mathbb{P}(Z\leq z) = \underbrace{1-e^{-(\lambda_1+\lambda_2)z}}_{\text{cdf of $Exp(\lambda_1+\lambda_2)$}}\quad z>0
⇒ P ( Z ≤ z ) = cdf of E x p ( λ 1 + λ 2 ) 1 − e − ( λ 1 + λ 2 ) z z > 0
Z ∼ E x p ( λ 1 + λ 2 ) Z\sim Exp(\lambda_1+\lambda_2)
Z ∼ E x p ( λ 1 + λ 2 )
P ( X i = m i n ( X 1 , ⋯   , X 2 ) ) = λ i λ 1 + ⋯ + λ n \mathbb{P}(X_i=min(X_1,\cdots,X_2))=\frac{\lambda_i}{\lambda_1+\cdots+\lambda_n}\\ P ( X i = m i n ( X 1 , ⋯ , X 2 ) ) = λ 1 + ⋯ + λ n λ i
Proof : (again for n = 2 n=2 n = 2 )P ( X 1 = m i n ( X 1 , X 2 ) ) = P ( X 1 ≤ X 2 ) = E ( P ( X 1 ≤ X 2 ∣ X 1 ) ) = E ( e − λ 2 X 1 ) = ∫ 0 ∞ e − λ 2 x λ 1 e − λ 1 x d x = λ 1 ∫ 0 ∞ e − ( λ 1 + λ 2 ) x d x = λ 1 λ 1 + λ 2 \begin{aligned}
&\mathbb{P}(X_1=min(X_1,X_2)) \\
&= \mathbb{P}(X_1\leq X_2) \\
&= \mathbb{E}(\mathbb{P}(X_1\leq X_2|X_1)) \\
&= \mathbb{E}(e^{-\lambda_2X_1}) \\
&= \int_0^\infty e^{-\lambda_2x}\lambda_1e^{-\lambda_1x} dx \\
&= \lambda_1\int_0^\infty e^{-(\lambda_1+\lambda_2)x}dx \\
&= \frac{\lambda_1}{\lambda_1+\lambda_2}
\end{aligned}
P ( X 1 = m i n ( X 1 , X 2 ) ) = P ( X 1 ≤ X 2 ) = E ( P ( X 1 ≤ X 2 ∣ X 1 ) ) = E ( e − λ 2 X 1 ) = ∫ 0 ∞ e − λ 2 x λ 1 e − λ 1 x d x = λ 1 ∫ 0 ∞ e − ( λ 1 + λ 2 ) x d x = λ 1 + λ 2 λ 1
5.3. Properties of Poisson Processes
5.3.1. Continuous-time Markov Property
P ( N ( t m ) = j ∣ N ( t m − 1 ) = i , N ( t m − 2 ) = i m − 2 , ⋯   , N ( t 1 ) = i 1 ) P ( N ( t m ) = j ∣ N ( t m − 1 = i ) ) \begin{aligned}
&\mathbb{P}(N(t_m)=j|N(t_{m-1})=i,N(t_{m-2})=i_{m-2},\cdots, N(t_1)=i_1) \\
&\mathbb{P}(N(t_m)=j|N(t_{m-1}=i))
\end{aligned}
P ( N ( t m ) = j ∣ N ( t m − 1 ) = i , N ( t m − 2 ) = i m − 2 , ⋯ , N ( t 1 ) = i 1 ) P ( N ( t m ) = j ∣ N ( t m − 1 = i ) )
for any m m m , t 1 < ⋯ < t m t_1<\cdots<t_m t 1 < ⋯ < t m , i 1 , i 2 , ⋯   , i m − 2 , i , j ∈ S i_1,i_2,\cdots, i_{m-2},i,j\in S i 1 , i 2 , ⋯ , i m − 2 , i , j ∈ S
Fact 5.3.1.1
The Poisson Process is the only renewal process having the Markov Property
Reason :
Since the exponential distribution is memoryless, the future arrival times will not depend on how long we have waited ⇒ \Rightarrow ⇒ The future of the counting process only depends on its current value.
In fact,
P ( N ( t + s ) = j ∣ N ( s ) + j ) time homogeneity = P ( N ( t ) = j ∣ N ( 0 ) = i ) only difference by which number we start ti ciybt = P ( N ( t ) = j − i ∣ N ( 0 ) = 0 ) \begin{aligned}
&\mathbb{P}(N(t+s)=j|N(s)+j) \\
\text{time homogeneity}&= \mathbb{P}(N(t)=j|N(0)=i) \quad\text{only difference by which number we start ti ciybt}\\
&= \mathbb{P}(N(t)=j-i|N(0)=0)
\end{aligned} time homogeneity P ( N ( t + s ) = j ∣ N ( s ) + j ) = P ( N ( t ) = j ∣ N ( 0 ) = i ) only difference by which number we start ti ciybt = P ( N ( t ) = j − i ∣ N ( 0 ) = 0 )
5.3.1.1. Independent Increments
The Poisson Process has independent increments
t 1 < t 2 < t 3 < t 4 ⇒ N ( t 2 ) − N ( t 1 ) ⎵ increments ⊥ ​ ​ ​ ⊥ N ( t 4 ) − N ( t 3 ) ⎵ increments \begin{aligned}
t_1<t_2<t_3<t_4
\Rightarrow \underbrace{N(t_2) - N(t_1)}_{\text{increments}} \perp\!\!\!\perp \underbrace{N(t_4)-N(t_3)}_{\text{increments}}
\end{aligned} t 1 < t 2 < t 3 < t 4 ⇒ increments N ( t 2 ) − N ( t 1 ) ⊥ ⊥ increments N ( t 4 ) − N ( t 3 )
Reasons :
Memoryless property of exponential distribution.
5.3.1.2. Poisson Increments
The Poisson Process has Poisson increments
N ( t 2 ) − N ( t 1 ) ∼ P o i ( λ ( t 2 − t 1 ) ) N(t_2)-N(t_1)\sim Poi(\lambda(t_2-t_1))
N ( t 2 ) − N ( t 1 ) ∼ P o i ( λ ( t 2 − t 1 ) )
Reason :
Let the arrival times between t 1 t_1 t 1 and t 2 t_2 t 2 be S 1 , c d o t s , S N S_1,cdots, S_N S 1 , c d o t s , S N , where N = N ( t 2 ) − N ( t 1 ) N=N(t_2)-N(t_1) N = N ( t 2 ) − N ( t 1 ) . Then W 1 = S 1 − t W_1=S_1-t W 1 = S 1 − t . W 2 = S 2 − S 1 W_2=S_2-S_1 W 2 = S 2 − S 1 , ⋯ \cdots ⋯ are i.i.d r.v's with distribution E x p ( λ ) Exp(\lambda) E x p ( λ )
N = n ⇔ W 1 + W 2 + ⋯ + W n ≤ t 2 − t 1 W 1 + W 2 + ⋯ + W n + W n + 1 > t 2 − t 1 \begin{aligned}
N=n\Leftrightarrow
&W_1+W_2+\cdots +W_n \leq t_2-t_1\\
&W_1+W_2+\cdots +W_n+W_{n+1} > t_2-t_1
\end{aligned}
N = n ⇔ W 1 + W 2 + ⋯ + W n ≤ t 2 − t 1 W 1 + W 2 + ⋯ + W n + W n + 1 > t 2 − t 1
Fact 5.3.1.2
If W 1 , ⋯   , W n W_1,\cdots,W_n W 1 , ⋯ , W n are i.i.d. r.v's following E x p ( λ ) Exp(\lambda) E x p ( λ ) , then W 1 + ⋯ + W n ∼ E r l a n g ( n , λ ) W_1+\cdots+W_n\sim Erlang(n,\lambda) W 1 + ⋯ + W n ∼ E r l a n g ( n , λ ) (a special type of G a m m a Gamma G a m m a )
c . d . f : F ( x ) = 1 ∑ k = 1 n − 1 1 k ! e − λ x ( λ x ) k c.d.f: F(x)=1\sum_{k=1}^{n-1}\frac{1}{k!}e^{-\lambda x}(\lambda x)^k
c . d . f : F ( x ) = 1 k = 1 ∑ n − 1 k ! 1 e − λ x ( λ x ) k
Thus,
P ( W 1 + W 2 + ⋯ + W n ≤ t 2 − t 1 ) = 1 − ∑ k = 0 n − 1 1 k ! e − λ ( t 2 − t 1 ) ( λ ( t 2 − t 1 ) ) k \begin{aligned}
&\mathbb{P}(W_1+W_2+\cdots+W_n\leq t_2-t_1)\\
&=1-\sum_{k=0}^{n-1}\frac{1}{k!}e^{-\lambda(t_2-t_1)}(\lambda(t_2-t_1))^k
\end{aligned} P ( W 1 + W 2 + ⋯ + W n ≤ t 2 − t 1 ) = 1 − k = 0 ∑ n − 1 k ! 1 e − λ ( t 2 − t 1 ) ( λ ( t 2 − t 1 ) ) k
P ( W 1 + W 2 + ⋯ + W n + W n + 1 ≤ t 2 − t 1 ) = 1 − ∑ k = 0 n 1 k ! e − λ ( t 2 − t 1 ) ( λ ( t 2 − t 1 ) ) k \begin{aligned}
&\mathbb{P}(W_1+W_2+\cdots+W_n+W_{n+1}\leq t_2-t_1)\\
&=1-\sum_{k=0}^{n}\frac{1}{k!}e^{-\lambda(t_2-t_1)}(\lambda(t_2-t_1))^k
\end{aligned} P ( W 1 + W 2 + ⋯ + W n + W n + 1 ≤ t 2 − t 1 ) = 1 − k = 0 ∑ n k ! 1 e − λ ( t 2 − t 1 ) ( λ ( t 2 − t 1 ) ) k
P ( N = n ) = P ( W 1 + ⋯ + W n ≤ t 2 + t 1 ) − P ( W 1 + ⋯ + W n + 1 ≤ t 2 − t 1 ) = 1 n ! e − λ ( t 2 − t 1 ) ( λ ( t 2 − t 1 ) ) n \begin{aligned}
\mathbb{P}(N=n)
&= \mathbb{P}(W_1+\cdots+W_n\leq t_2+t_1) - \mathbb{P}(W_1+\cdots+W_{n+1}\leq t_2-t_1) \\
&=\frac{1}{n!}e^{-\lambda(t_2-t_1)}(\lambda(t_2-t_1))^n
\end{aligned}
P ( N = n ) = P ( W 1 + ⋯ + W n ≤ t 2 + t 1 ) − P ( W 1 + ⋯ + W n + 1 ≤ t 2 − t 1 ) = n ! 1 e − λ ( t 2 − t 1 ) ( λ ( t 2 − t 1 ) ) n
In particular, N ( t ) = N ( t ) − N ( 0 ) ∼ P o i ( λ t ) N(t)=N(t)-N(0)\sim Poi(\lambda t) N ( t ) = N ( t ) − N ( 0 ) ∼ P o i ( λ t )
E ( N ( 1 ) ) = λ ← intensity: expected number of arrivals in one unit of time \mathbb{E}(N(1))=\lambda \quad \leftarrow \text{intensity: expected number of arrivals in one unit of time}
E ( N ( 1 ) ) = λ ← intensity: expected number of arrivals in one unit of time
5.3.1.3. Combining and Thining of Poisson Process
Theorem :
{ N 1 ( t ) } ∼ P o i ( λ 1 t ) \{N_1(t)\}\sim Poi(\lambda_1 t)
{ N 1 ( t ) } ∼ P o i ( λ 1 t )
{ N 2 ( t ) } ∼ P o i ( λ 2 t ) \{N_2(t)\}\sim Poi(\lambda_2 t)
{ N 2 ( t ) } ∼ P o i ( λ 2 t )
{ N 1 ( t ) } \{N_1(t)\} { N 1 ( t ) } and { N 2 ( t ) } \{N_2(t)\} { N 2 ( t ) } are independent
Let N ( t ) = N 1 ( t ) + N 2 ( t ) N(t)=N_1(t)+N_2(t) N ( t ) = N 1 ( t ) + N 2 ( t ) , then { N ( t ) } ∼ P o i ( ( λ 1 + λ 2 ) t ) \{N(t)\}\sim Poi((\lambda_1+\lambda_2)t) { N ( t ) } ∼ P o i ( ( λ 1 + λ 2 ) t )
The combined Poisson Process is still a Poisson Process, with intensity being the sum of intensities.
Reason : Memoryless property, and
m i n ( W 1 , W 2 ) W 1 ∼ E x p ( λ 1 ) W 2 ∼ E x p ( λ 2 ) W 1 ⊥ ​ ​ ​ ⊥ W 2 min(W_1, W_2) \\
W1\sim Exp(\lambda_1)\\
W2\sim Exp(\lambda_2)\\
W_1\perp\!\!\!\perp W_2
m i n ( W 1 , W 2 ) W 1 ∼ E x p ( λ 1 ) W 2 ∼ E x p ( λ 2 ) W 1 ⊥ ⊥ W 2
⇒ \Rightarrow ⇒ the combined process is the counting process of events with interarrival time following E x p ( λ 1 + λ 2 Exp(\lambda_1+\lambda_2 E x p ( λ 1 + λ 2
Thinning
Let { N ( t ) } ∼ P o i ( λ t ) \{N(t)\}\sim Poi(\lambda t) { N ( t ) } ∼ P o i ( λ t ) . Each arrival (customer) is labeled as type 1 or type 2, with probability p p p and 1 − p 1-p 1 − p , independently from others.
Let N 1 ( t ) N_1(t) N 1 ( t ) and N 2 ( t ) N_2(t) N 2 ( t ) be the number of customers of type 1and type 2 respectively, who arrived before time t t t . Then
{ N 1 ( t ) ∼ P o i ( p λ t ) { N 2 ( t ) ∼ P o i ( ( 1 − p ) λ t ) and { N 1 ( t ) } ⊥ ​ ​ ​ ⊥ { N 2 ( t ) } \begin{aligned}
\{N_1(t)&\sim Poi(p\lambda t)\\
\{N_2(t)&\sim Poi((1-p)\lambda t)\\
\text{and } \{N_1(t)\}& \perp\!\!\!\perp \{N_2(t)\}
\end{aligned}
{ N 1 ( t ) { N 2 ( t ) and { N 1 ( t ) } ∼ P o i ( p λ t ) ∼ P o i ( ( 1 − p ) λ t ) ⊥ ⊥ { N 2 ( t ) }
Reason : This is the inverse procedure of combining two independent Poisson processes into one Poisson process
5.3.1.4 Order Statistics Property
Let X 1 , … , X n X_1,\ldots,X_n X 1 , … , X n be i.i.d. r.v's. The order statistics of X 1 , … , X n X_1, \ldots, X_n X 1 , … , X n are random variables defined as follows.
X ( 1 ) = m i n { X 1 , … , X n } X ( 2 ) = 2nd smallest among X 1 , … , X n ⋮ X ( n ) = m a x { X 1 , … , X n } \begin{aligned}
&X_{(1)} = min\{X_1,\ldots,X_n\} \\
&X_{(2)} = \text{2nd smallest among} X_1,\ldots,X_n \\
&\quad\quad\quad\quad\quad\vdots \\
&X_{(n)} = max\{X_1,\ldots,X_n\}
\end{aligned} X ( 1 ) = m i n { X 1 , … , X n } X ( 2 ) = 2nd smallest among X 1 , … , X n ⋮ X ( n ) = m a x { X 1 , … , X n }
In orher words, X ( 1 ) , … , X ( n ) X_{(1)},\ldots,X_{(n)} X ( 1 ) , … , X ( n ) are such that { X ( 1 ) , … , X ( n ) } = { X 1 , … , X n } \{X_{(1)},\ldots,X_{(n)}\}=\{X_1,\ldots,X_n\} { X ( 1 ) , … , X ( n ) } = { X 1 , … , X n } and X ( n ) ≤ X ( 2 ) ≤ ⋯ ≤ X ( n ) X_{(n)}\leq X_{(2)}\leq\cdots\leq X_{(n)} X ( n ) ≤ X ( 2 ) ≤ ⋯ ≤ X ( n )
Thus, let { N ( t ( } ∼ P o i ( λ t ) \{N(t(\}\sim Poi(\lambda t) { N ( t ( } ∼ P o i ( λ t ) . Condition on N ( t ) = n N(t)=n N ( t ) = n , the points/arrivals of N N N in [ 0 , t ] [0,t] [ 0 , t ] are distributed as the order statistics of n n n i.i.d. uniform r.v's on [ 0 , t ] [0,t] [ 0 , t ]
That is
( S 1 , … , S n ∣ N ( t ) = n ) = d ( U ( 1 ) , … , U ( n ) ) (S_1,\ldots, S_n | N(t)=n)\stackrel{d}{=} (U_{(1)},\ldots, U_{(n)})
( S 1 , … , S n ∣ N ( t ) = n ) = d ( U ( 1 ) , … , U ( n ) )
where, U ( 1 ) , … , U ( n ) U_{(1)},\ldots,U_{(n)} U ( 1 ) , … , U ( n ) are the order statistics of U 1 , … , U n ∼ i i d U n i f [ 0 , t ] U_1, \ldots, U_n\stackrel{iid}{\sim}Unif[0,t] U 1 , … , U n ∼ i i d U n i f [ 0 , t ]
Reason :
f S 1 ∣ { N ( t ) = 1 } ( s ) = f S 1 ( s ) P ( W 2 > t − s ) P ( N ( t ) = 1 ) ∝ f S 1 ( s ) P ( W 2 ⎵ E x p ( λ ) > t − s ) = λ e − λ s e − λ ( t − s ) = λ e − λ t ⎵ const w.r.t. s \begin{aligned}
f_{S_1|\{N(t)=1\}}(s)
&=\frac{f_{S_1}(s)\mathbb{P}(W_2>t-s)}{\mathbb{P}(N(t)=1)} \\
&\propto f_{S_1}(s)\mathbb{P}(\underbrace{W_2}_{Exp(\lambda)}>t-s) \\
&= \lambda e^{-\lambda s} e^{-\lambda(t-s)} \\
&= \underbrace{\lambda e^{-\lambda t}}_{\text{const w.r.t. s}}
\end{aligned}
f S 1 ∣ { N ( t ) = 1 } ( s ) = P ( N ( t ) = 1 ) f S 1 ( s ) P ( W 2 > t − s ) ∝ f S 1 ( s ) P ( E x p ( λ ) W 2 > t − s ) = λ e − λ s e − λ ( t − s ) = const w.r.t. s λ e − λ t
⇒ S 1 ∣ { N ( t ) = 1 } ∼ U n i f [ 0 , t ] \Rightarrow S_1|\{N(t)=1\}\sim Unif[0,t]
⇒ S 1 ∣ { N ( t ) = 1 } ∼ U n i f [ 0 , t ]
As a result of the order statistics property, we have preposition
N ( s ) ∣ { N ( t ) = n } ∼ B i n ( n , s t ) for s ≤ t N(s)|\{N(t)=n\}\sim Bin(n,\frac{s}{t})\quad\quad\text{for } s\leq t
N ( s ) ∣ { N ( t ) = n } ∼ B i n ( n , t s ) for s ≤ t
Reason : Given N ( t ) = n N(t)=n N ( t ) = n , then
N ( s ) = # { S i : S i ≤ s , i = 1 , 2 , … , n } Since { U ( i ) } is a = # { U ( i ) : U ( i ) ≤ s , i = 1 , 2 , … , n } permutation of { U i } = # { U i : U i ≤ s , i = 1 , 2 , … , n } \begin{aligned}
N(s)
&= \#\{S_i : S_i\leq s, i=1,2,\ldots,n\} \\
\text{Since $\{U_{(i)}\}$ is a} &= \#\{U_{(i)} : U_{(i)}\leq s, i=1,2,\ldots,n\} \\
\text{ permutation of $\{U_i\}$} &= \#\{U_i : U_i\leq s, i=1,2,\ldots,n\} \\
\end{aligned}
N ( s ) Since { U ( i ) } is a permutation of { U i } = # { S i : S i ≤ s , i = 1 , 2 , … , n } = # { U ( i ) : U ( i ) ≤ s , i = 1 , 2 , … , n } = # { U i : U i ≤ s , i = 1 , 2 , … , n }
U i ∼ i i d U n i f [ 0 , t ] U_i\stackrel{iid}{\sim}Unif[0,t]
U i ∼ i i d U n i f [ 0 , t ]
P ( U i ≤ s ) = s t i = 1 , … , n \mathbb{P}(U_i\leq s) = \frac{s}{t}\quad\quad i=1,\ldots,n
P ( U i ≤ s ) = t s i = 1 , … , n
⇒ N ( s ) ∣ { N ( t ) = n } ∼ B i n ( n , s t ) \Rightarrow N(s)|\{N(t)=n\}\sim Bin(n,\frac{s}{t})
⇒ N ( s ) ∣ { N ( t ) = n } ∼ B i n ( n , t s )
6. Continuous-Time Markov Chain
6.1. Definitions and Structures
Definition 6.1.1. Continuous-time Stochastic Process
A continuous-time stochastic process { X ( t ) } t ≥ 0 \{X(t)\}_{t\geq 0} { X ( t ) } t ≥ 0 is called a continuous-time Markov Chain (CTMC), if its state space is at most countable, and it satisfies the continuous-time Markov property:
P ( X ( t m ) = j ∣ X ( t m − 1 = i , X ( t m − 2 ) = i m − 2 , … , X ( t 1 ) = i 1 = P ( X ( t m ) = j ∣ X ( t m − 1 = i ) ) for any m , t 1 < t 2 < ⋯ < t m , i 1 , … , i m − 2 , i , j ∈ S \begin{aligned}
&\mathbb{P}(X(t_m)=j|X(t_{m-1}=i, X(t_{m-2})=i_{m-2},\ldots, X(t_1)=i_1 \\
=& \mathbb{P}(X(t_m)=j|X(t_{m-1}=i))\\
&\quad\quad \text{for any }m, t_1<t_2<\cdots<t_m, i_1,\ldots, i_{m-2}, i, j\in S
\end{aligned}
= P ( X ( t m ) = j ∣ X ( t m − 1 = i , X ( t m − 2 ) = i m − 2 , … , X ( t 1 ) = i 1 P ( X ( t m ) = j ∣ X ( t m − 1 = i ) ) for any m , t 1 < t 2 < ⋯ < t m , i 1 , … , i m − 2 , i , j ∈ S
As DTMC, typically S = { 0 , … , m } S=\{0,\ldots,m\} S = { 0 , … , m } or { 1 , … , m } \{1,\ldots,m\} { 1 , … , m } or { 0 , ± 1 , ± 2 , l d o t s } \{0,\pm1,\pm2,ldots\} { 0 , ± 1 , ± 2 , l d o t s }
Time is continuous, but the state space is discrete ⇒ \Rightarrow ⇒ Process will "jump" between states
It can be regarded as a random step function.
Therefore, we need to specify two things:
When the jumps happen? ⇔ \Leftrightarrow ⇔ How long the process stays in a state?
Given the process is in state i i i , it will stay in this state for an exponential random time, with parameter denoted as λ i \lambda_i λ i
Reason :
Markov property ⇒ \Rightarrow ⇒ when the process will jump in the future only depends on its current state, not on how long it has been in the current state ⇒ \Rightarrow ⇒ memoryless property ⇒ \Rightarrow ⇒ exponential.
Markov Property ⇒ \Rightarrow ⇒ the parameter of the exponential can only depend on the current state i i i .
When it jumps, where it jumps to.
The continuous-time Markov Chain will jump according to a transition probability q i j q_{ij} q i j , which only depends on i i i and j j j
Reason :
Markov property ⇒ \Rightarrow ⇒ given i , q i j ⎵ f u t u r e i, \underbrace{q_{ij}}_{future} i , f u t u r e q i j can not depend on anything else.
q i j = P ( X ( t ) jumps to j ∣ X ( t ) jumps from j ) ⇒ { q i j = 0 q i j ≥ 0 , j = i ∑ j ∈ S q i j = ∑ j ∈ S , j = i q i j = 1 q_{ij}=\mathbb{P}(X(t) \text{ jumps to } j|X(t) \text{ jumps from } j) \Rightarrow \begin{cases}
q_{ij}=0\quad\quad\quad\quad\quad\quad\quad q_{ij}\geq 0, j\cancel{=}i \\
\sum_{j\in S}q_{ij}=\sum_{j\in S, j\cancel{=}i}\quad q_{ij}=1
\end{cases} q i j = P ( X ( t ) jumps to j ∣ X ( t ) jumps from j ) ⇒ { q i j = 0 q i j ≥ 0 , j = i ∑ j ∈ S q i j = ∑ j ∈ S , j = i q i j = 1
Define Q = { q i j } i , j ∈ S Q = [ 0 q 01 q 02 ⋯ q 10 0 q 12 ⋯ ⋮ ⋮ ⋮ ⋮ ] Q=\{q_{ij}\}_{i,j\in S}\\Q=\begin{bmatrix}
0&q_{01}&q_{02}&\cdots \\
q_{10}&0&q_{12}&\cdots \\
\vdots&\vdots&\vdots&\vdots
\end{bmatrix} Q = { q i j } i , j ∈ S Q = ⎣ ⎢ ⎡ 0 q 1 0 ⋮ q 0 1 0 ⋮ q 0 2 q 1 2 ⋮ ⋯ ⋯ ⋮ ⎦ ⎥ ⎤
the row sums of Q Q Q are 1
A CTMC is fully characterized by { λ i } i ∈ S \{\lambda_i\}_{i\in S} { λ i } i ∈ S and Q = { q i j } i , j ∈ S Q=\{q_{ij}\}_{i,j\in S} Q = { q i j } i , j ∈ S
To conclude, a CTMC stays in a state i i i for an exponential random time T i T_i T i ; then jumps to another state j j j with probability q i j q_{ij} q i j , then stays in j j j for an exponential random time T j T_j T j , ⋯ \cdots ⋯ , all the jumps and times spent in different states are independent.
Example 6.1.1.1.
We have seen that the Poisson process satisfy the continuous-time Markov property ⇒ \Rightarrow ⇒ it is a CTMC
λ i = λ i ∈ S = { 0 , 1 , 2 , ⋯   } \lambda_i=\lambda \quad\quad i\in S=\{0,1,2,\cdots\}
λ i = λ i ∈ S = { 0 , 1 , 2 , ⋯ }
(interarrival times do not depend on current state. ⇒ \Rightarrow ⇒ time spent in different states are i.i.d.)
q i , i + i = 1 q i j = 0 otherwise ( j = i + 1 ) q_{i,i+i} = 1\quad\quad q_{ij} = 0 \text{ otherwise}(j\cancel{=}i+1)
q i , i + i = 1 q i j = 0 otherwise ( j = i + 1 )
6.2. Generator Matrix
Similar to the discrete-time case, we have the transition probability at time t t t .
P i j ( t ) = P ( X ( t ) = j ∣ X ( 0 ) = i ) = P ( X ( t + s ) = j ∣ X ( s ) = i ) assume the MC is time-homogeneous \begin{aligned}
P_{ij}(t)
&= \mathbb{P}(X(t)=j|X(0)=i) \\
&= \mathbb{P}(X(t+s)=j|X(s)=i) \quad\quad\text{assume the MC is time-homogeneous}
\end{aligned} P i j ( t ) = P ( X ( t ) = j ∣ X ( 0 ) = i ) = P ( X ( t + s ) = j ∣ X ( s ) = i ) assume the MC is time-homogeneous
and matrix
P ( t ) = { P i j ( t ) } i , j ∈ S P(t)=\{P_{ij}(t)\}_{i,j\in S}
P ( t ) = { P i j ( t ) } i , j ∈ S
P ( t ) = ( p 00 ( t ) p 01 ( t ) ⋯ p 10 ( t ) p 11 ( t ) ⋯ ⋮ ⋮ ) P(t)=\begin{pmatrix}
p_{00}(t) & p_{01}(t) & \cdots \\
p_{10}(t) & p_{11}(t) & \cdots \\
\vdots & \vdots &
\end{pmatrix} P ( t ) = ⎝ ⎜ ⎛ p 0 0 ( t ) p 1 0 ( t ) ⋮ p 0 1 ( t ) p 1 1 ( t ) ⋮ ⋯ ⋯ ⎠ ⎟ ⎞
The C-K equation still holds
p ( t + s ) = P ( t ) ⋅ P ( s ) p(t+s) = P(t)\cdot P(s)
p ( t + s ) = P ( t ) ⋅ P ( s )
Proof : ∀ i , j ∈ S \forall i,j\in S ∀ i , j ∈ S
P i j ( t + s ) = P ( X ( t + s ) = j ∣ X ( 0 ) = i ) = ∑ k ∈ S P ( X ( t + s ) = j ∣ X ( t ) = k , X ( 0 ) = i ) ⋅ P ( X ( t ) = k ∣ X ( 0 ) = i ) = ∑ k ∈ S P ( X ( s ) + j ∣ X ( o ) = k ) ⋅ P ( X ( t ) = k ∣ X ( 0 ) = i ) = ∑ k ∈ S P k j ( s ) ⋅ P i k ( t ) = ( P ( t ) ⋅ P ( s ) ) i j ■ \begin{aligned}
P_{ij}(t+s)
&= \mathbb{P}(X(t+s)=j|X(0)=i) \\
&= \sum_{k\in S}\mathbb{P}(X(t+s)=j|X(t)=k,\cancel{X(0)=i})\cdot \mathbb{P}(X(t)=k|X(0)=i) \\
&= \sum_{k\in S}\mathbb{P}(X(s)+j|X(o)=k)\cdot\mathbb{P}(X(t)=k|X(0)=i) \\
&= \sum_{k\in S}P_{kj}(s)\cdot P_{ik}(t)\\
&=(P(t)\cdot P(s))_{ij}\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\blacksquare
\end{aligned} P i j ( t + s ) = P ( X ( t + s ) = j ∣ X ( 0 ) = i ) = k ∈ S ∑ P ( X ( t + s ) = j ∣ X ( t ) = k , X ( 0 ) = i ) ⋅ P ( X ( t ) = k ∣ X ( 0 ) = i ) = k ∈ S ∑ P ( X ( s ) + j ∣ X ( o ) = k ) ⋅ P ( X ( t ) = k ∣ X ( 0 ) = i ) = k ∈ S ∑ P k j ( s ) ⋅ P i k ( t ) = ( P ( t ) ⋅ P ( s ) ) i j ■
Note that we have
P ( 0 ) = I P(0) = I
P ( 0 ) = I
( P i i ( 0 ) = P ( X ( 0 ) = i ∣ X ( 0 ) = i ) = 1 , P i j ( 0 ) = 0 for j = i ) (P_{ii}(0)=\mathbb{P}(X(0)=i|X(0)=i) = 1, P_{ij}(0)=0\text{ for } j\cancel{=}i)
( P i i ( 0 ) = P ( X ( 0 ) = i ∣ X ( 0 ) = i ) = 1 , P i j ( 0 ) = 0 for j = i )
and
lim t → 0 + P ( t ) = I \lim_{t\rightarrow0^+} P(t)=I
t → 0 + lim P ( t ) = I
Actually, we have the following stronger result:
R : = lim h → 0 + P ( h ) − P ( 0 ) h = lim h → 0 + P ( h ) − I h R:= \lim_{h\rightarrow 0^+}\frac{P(h)-P(0)}{h}=\lim_{h\rightarrow0^+}\frac{P(h)-I}{h}
R : = h → 0 + lim h P ( h ) − P ( 0 ) = h → 0 + lim h P ( h ) − I
exists, and is called the (infinitesimal) generator matrix of { X ( t ) } t ≥ 0 \{X(t)\}_{t\geq 0} { X ( t ) } t ≥ 0
Entry-wise:
R i j = lim h → 0 + P i j ( h ) − P i j ( 0 ) h = { lim h → 0 + P i i ( h ) − 1 h ≤ 0 j = i lim h → 0 + P i i ( h ) h ≥ 0 j = i R_{ij}=\lim_{h\rightarrow0^+}\frac{P_{ij}(h)-P_{ij}(0)}{h} = \begin{cases}
\lim_{h\rightarrow0^+}\frac{P_{ii}(h)-1}{h}\leq 0\quad j = i\\
\lim_{h\rightarrow0^+}\frac{P_{ii}(h)}{h}\geq 0\quad\quad j \cancel{=}i
\end{cases} R i j = h → 0 + lim h P i j ( h ) − P i j ( 0 ) = { lim h → 0 + h P i i ( h ) − 1 ≤ 0 j = i lim h → 0 + h P i i ( h ) ≥ 0 j = i
Relation between R R R and { λ i } i ∈ S \{\lambda_i\}_{i\in S} { λ i } i ∈ S and Q = { q i j } i , j ∈ S Q=\{q_{ij}\}_{i,j\in S} Q = { q i j } i , j ∈ S
R i i = − λ i , R i j = λ i q i j j = i R_{ii}=-\lambda_i\quad,\quad R_{ij}=\lambda_iq_{ij}\quad\quad j\cancel{=}i
R i i = − λ i , R i j = λ i q i j j = i
Reason :
R i i = lim h → 0 + P i i ( h ) − 1 h = lim h → 0 + P ( T i > h ) − 1 h \begin{aligned}
R_{ii}=\lim_{h\rightarrow0^+}\frac{P_{ii}(h)-1}{h} = \lim_{h\rightarrow0^+}\frac{\mathbb{P}(T_i>h) - 1}{h}
\end{aligned} R i i = h → 0 + lim h P i i ( h ) − 1 = h → 0 + lim h P ( T i > h ) − 1
Where T i T_i T i is the random time the process stays in i i i . The equality holds because when h h h is very small, the probability of having two or more jumps in time h h h is negligible.
P ( X ( h ) = i ∣ X ( 0 ) = i ) = P ( T i > h ) + o ( h ) ← having at least 2 jumps and back to i \mathbb{P}(X(h)=i|X(0)=i) = \mathbb{P}(T_i>h) +o(h)\leftarrow\text{having at least 2 jumps and back to $i$}
P ( X ( h ) = i ∣ X ( 0 ) = i ) = P ( T i > h ) + o ( h ) ← having at least 2 jumps and back to i
∗ a ( h ) = o ( h ) if lim h → 0 a ( h ) h = 0 * a(h)=o(h) \text{ if } \lim_{h\rightarrow 0}\frac{a(h)}{h}=0
∗ a ( h ) = o ( h ) if h → 0 lim h a ( h ) = 0
R i i = lim h → 0 + P ( T i ⏞ E x p ( λ i ) > h ) h = lim h → 0 + e − λ i h − 1 j = lim h → 0 + e − λ i h − e − λ i ⋅ 0 j = d e − λ i d h ∣ h = 0 = − λ i e − λ i h ∣ h = 0 = − λ i R i j = lim h → 0 + P i j ( h ) h = lim h → 0 + P ( X ( h ) = j ∣ X ( 0 ) = i ) h Only one jump happens two = lim h → 0 + ( P ( T i < h , X ( T i ) = j ) h jumps happen → uegligeable = q i j lim h → 0 + P ( T i < h ) h = q i j lim h → 0 + 1 − e − λ i h h = q i j lim h → 0 + e − λ i 0 − e − λ i h h = q i j ( − d e − λ i h d h ∣ h = 0 ) = q i j λ i \begin{aligned}
R_{ii}
&=\lim_{h\rightarrow 0^+}\frac{\mathbb{P}(\overbrace{T_i}^{Exp(\lambda_i)}>h)}{h}\\
&=\lim_{h\rightarrow 0^+}\frac{e^{-\lambda_i h}-1}{j}\\
&=\lim_{h\rightarrow 0^+}\frac{e^{-\lambda_i h}-e^{-\lambda_i \cdot 0}}{j}\\
&= \frac{de^{-\lambda_i}}{dh}\Big|_{h=0}\\
&= -\lambda_i e^{-\lambda_ih}|_{h=0}=-\lambda_i \\
\end{aligned}\\
\begin{aligned}
R_{ij}=\lim_{h\rightarrow 0^+}\frac{P_{ij}(h)}{h}
&=\lim_{h\rightarrow0^+}\frac{\mathbb{P}(X(h)=j|X(0)=i)}{h}\\
\text{Only one jump happens two }
&=\lim_{h\rightarrow0^+}(\frac{\mathbb{P}(T_i<h,X(T_i)=j)}{h} \\
\text{jumps happen$\rightarrow$ uegligeable}
&=q_{ij}\lim_{h\rightarrow0^+}\frac{\mathbb{P}(T_i<h)}{h} \\
&=q_{ij}\lim_{h\rightarrow0^+}\frac{1-e^{-\lambda_ih}}{h} \\
&=q_{ij}\lim_{h\rightarrow0^+}\frac{e^{-\lambda_i0}-e^{-\lambda_ih}}{h} \\
&= q_{ij}(\frac{-de^{-\lambda_ih}}{dh}\Big|_{h=0}) \\
&= q_{ij}\lambda_i
\end{aligned} \\
R i i = h → 0 + lim h P ( T i E x p ( λ i ) > h ) = h → 0 + lim j e − λ i h − 1 = h → 0 + lim j e − λ i h − e − λ i ⋅ 0 = d h d e − λ i ∣ ∣ ∣ h = 0 = − λ i e − λ i h ∣ h = 0 = − λ i R i j = h → 0 + lim h P i j ( h ) Only one jump happens two jumps happen → uegligeable = h → 0 + lim h P ( X ( h ) = j ∣ X ( 0 ) = i ) = h → 0 + lim ( h P ( T i < h , X ( T i ) = j ) = q i j h → 0 + lim h P ( T i < h ) = q i j h → 0 + lim h 1 − e − λ i h = q i j h → 0 + lim h e − λ i 0 − e − λ i h = q i j ( d h − d e − λ i h ∣ ∣ ∣ h = 0 ) = q i j λ i
Thus, we conclude that
R i i = − λ i , R i j = λ i q i j ] R_{ii}=-\lambda_i,\quad R_{ij}=\lambda_iq_{ij]}
R i i = − λ i , R i j = λ i q i j ]
Note that R i i ≤ 0 R_{ii}\leq 0 R i i ≤ 0 , R i j ≥ 0 , R_{ij}\geq 0, R i j ≥ 0 , j = i j\cancel{=}i j = i
∑ i ∈ S R i j = R i i + ∑ i ∈ S , j = i R i j = − λ i + ∑ i ∈ S , j = i λ i q i j ← ∑ j ∈ S , j = i q i j = 1 = − λ i + λ i = 0 \begin{aligned}
\sum_{i\in S}R_{ij}
&= R_{ii}+\sum_{i\in S, j\cancel{=}i}R_{ij} \\
&= -\lambda_i+\sum_{i\in S, j\cancel{=}i}\lambda_iq_{ij} \quad \leftarrow \sum_{j\in S, j\cancel{=}i}q_{ij}=1 \\
&= -\lambda_i+\lambda_i \\
&= 0
\end{aligned}
i ∈ S ∑ R i j = R i i + i ∈ S , j = i ∑ R i j = − λ i + i ∈ S , j = i ∑ λ i q i j ← j ∈ S , j = i ∑ q i j = 1 = − λ i + λ i = 0
The row sums of R R R are 0
R=\begin{pmatrix}
-\lambda_0 & \lambda_0q_{01} & \lambda_0q_{02} & \cdots \\
-\lambda_1q_{10} & -\lambda_1 & \lambda_1q_{12} & \cdots \\
-\lambda_2q_{20} & \lambda_2q_{21} & -\lambda_2 & \cdots \\
\vdots &\vdots&\vdots&\ddots
\end{pmatrix}
$$-\lambda_0
$-\lambda_i$ : probability flow / rate going out of state $i$
$\lambda_iq_{ij}$ : probability flow / rate going from $i$ to $j$
From $R$ to $\{\lambda_i\}$ and $Q$:
$\quad$ If we know $R$, then
$$ \lambda_i=-R_{ii}
q i j = − R i j R i i j = i q_{ij} = \frac{-R_{ij}}{R_{ii}}\quad j\cancel{=}i
q i j = R i i − R i j j = i
Thus, there is a 1 − 1 1-1 1 − 1 relation between { λ i } i ∈ S + { q i j } i , j ∈ S \{\lambda_i\}_{i\in S} + \{q_{ij}\}_{i,j\in S} { λ i } i ∈ S + { q i j } i , j ∈ S and R = { R i j } i , j ∈ S R=\{R_{ij}\}_{i,j\in S} R = { R i j } i , j ∈ S
⇒ \Rightarrow ⇒ the generator R R R itself also fully characterizes the transitional behaviour of the CDMC.
Conclusion : { λ i } + { q i j } i , j ∈ S \{\lambda_i\}+\{q_{ij}\}_{i,j\in S} { λ i } + { q i j } i , j ∈ S and { R i j } i , j ∈ S \{R_{ij}\}_{i,j\in S} { R i j } i , j ∈ S are two sets of parameters that can be used to specify a CDMC.
Example 6.2.1. Poisson Process
λ i = λ Q = ( 0 1 0 1 0 1 ⋱ ⋱ ) \lambda_i=\lambda\\
Q=\begin{pmatrix}
0 & 1 \\
& 0 & 1 \\
& & 0 & 1 \\ \\
& & & \ddots &\ddots \\
\end{pmatrix}
λ i = λ Q = ⎝ ⎜ ⎜ ⎜ ⎜ ⎛ 0 1 0 1 0 1 ⋱ ⋱ ⎠ ⎟ ⎟ ⎟ ⎟ ⎞
⇒ R i i = − λ i = − λ i = 0 , 1 , … R i j = λ i q i j = { λ j = i + 1 0 j = i , i + 1 R = ( − λ λ − λ λ ⋱ ⋱ ) \begin{aligned}
\Rightarrow
&R_{ii}=-\lambda_i=-\lambda \quad\quad i=0,1,\ldots\\
&R_{ij}=\lambda_iq_{ij}=\begin{cases}
\lambda \quad j=i+1 \\
0 \quad j\cancel{=}i, i+1
\end{cases}
\end{aligned}\\
R=\begin{pmatrix}
-\lambda & \lambda \\
&-\lambda & \lambda \\ \\
& & \ddots &\ddots \\
\end{pmatrix}
⇒ R i i = − λ i = − λ i = 0 , 1 , … R i j = λ i q i j = { λ j = i + 1 0 j = i , i + 1 R = ⎝ ⎜ ⎜ ⎛ − λ λ − λ λ ⋱ ⋱ ⎠ ⎟ ⎟ ⎞
Example 6.2.2. 3-tables in a restaurant
Parties of customers arrive according to a Poisson Process with intensity λ \lambda λ .
If there are free tables → \rightarrow → the party is served, and spend an exponential amount of time with average 1 μ \frac{1}{\mu} μ 1 (parameter μ \mu μ )
If there is no free table → \rightarrow → the party leaves immediately
Let X ( t ) X(t) X ( t ) be the number of occupied tables at time t t t ⇒ \Rightarrow ⇒ S = { 0 , 1 , 2 , 3 } S=\{0,1,2,3\} S = { 0 , 1 , 2 , 3 }
Since all the interarrival times and service times are exponential and independent, the process { X ( t ) } \{X(t)\} { X ( t ) } is a CTMC.
Find λ i \lambda_i λ i and q i j q_{ij} q i j :
For i = 0 i=0 i = 0 :
q 01 = 1 q_{01}=1\quad\quad\quad q 0 1 = 1 no customer → \rightarrow → one table occupied
λ 0 = λ \lambda_0=\lambda\quad\quad\quad λ 0 = λ leave state 0 ⇔ \Leftrightarrow ⇔ one party arrives
q 02 = q 03 = 0 q_{02}=q_{03}=0 q 0 2 = q 0 3 = 0
For i = 1 i=1 i = 1 :
Potential change of states { 1 → 2 if one party arrives first ∼ E x p ( λ ) ∼ T 1 → 0 if a service is completed first ∼ E x p ( μ ) ∼ S \begin{cases}1\rightarrow 2\quad\quad\text{if one party arrives first} \quad\quad\sim Exp(\lambda)\sim T\\1\rightarrow 0\quad\quad\text{if a service is completed first}\sim Exp(\mu)\sim S \end{cases} { 1 → 2 if one party arrives first ∼ E x p ( λ ) ∼ T 1 → 0 if a service is completed first ∼ E x p ( μ ) ∼ S
Which one actually happens depends on which time is smaller.
Recall a property of exponential
m i n ( T ⎵ E x p ( λ ) , S ⎵ E x p ( μ ) ) ∼ E x p ( λ + μ ) ← the distribution of the time spent in the current state min(\underbrace{T}_{Exp(\lambda)}, \underbrace{S}_{Exp(\mu)})\sim Exp(\lambda +\mu) \leftarrow \text{the distribution of the time spent in the current state}
m i n ( E x p ( λ ) T , E x p ( μ ) S ) ∼ E x p ( λ + μ ) ← the distribution of the time spent in the current state
P ( T < S ) = λ λ + μ q 12 = P ( T < S ) ) = λ λ + μ q 10 = 1 − q 12 = μ λ + μ q 13 = 0 \begin{aligned}
\mathbb{P}(T<S ) &= \frac{\lambda}{\lambda+\mu} \\
q_{12}=\mathbb{P}(T<S)) &= \frac{\lambda}{\lambda+\mu} \\
q_{10}=1-q_{12}&=\frac{\mu}{\lambda+\mu} \\
q_{13} &= 0
\end{aligned} P ( T < S ) q 1 2 = P ( T < S ) ) q 1 0 = 1 − q 1 2 q 1 3 = λ + μ λ = λ + μ λ = λ + μ μ = 0
Thus,
λ 1 = λ + μ \lambda_1=\lambda+\mu λ 1 = λ + μ
q 12 = P ( T < S ) = λ λ + μ q_{12}=\mathbb{P}(T<S)=\frac{\lambda}{\lambda+\mu} q 1 2 = P ( T < S ) = λ + μ λ
q 10 = 1 − q 12 = μ λ + μ q_{10}=1-q_{12}=\frac{\mu}{\lambda+\mu} q 1 0 = 1 − q 1 2 = λ + μ μ
q 13 = 0 q_{13}=0 q 1 3 = 0
Similarly, when 2 tables are occupied.
T T T : time until next arrival ∼ E x p ( λ ) \sim Exp(\lambda) ∼ E x p ( λ )
S 1 S_1 S 1 : service time for table 1 ∼ E x p ( μ ) \sim Exp(\mu) ∼ E x p ( μ )
S 2 S_2 S 2 : service time for table 2 ∼ E x p ( μ ) \sim Exp(\mu) ∼ E x p ( μ )
⇒ \Rightarrow ⇒ m i n ( T , S 1 , S 2 ) ∼ E x p ( λ + 2 μ ) min(T, S_1, S_2)\sim Exp(\lambda +2 \mu) m i n ( T , S 1 , S 2 ) ∼ E x p ( λ + 2 μ )
P ( T < S 1 , S 2 ⎵ 2 → 3 ) = λ λ + 2 μ \mathbb{P}(\underbrace{T<S_1,S_2}_{2\rightarrow 3}) = \frac{\lambda}{\lambda+2\mu}
P ( 2 → 3 T < S 1 , S 2 ) = λ + 2 μ λ
Thus,
λ 2 = λ + 2 μ \lambda_2 = \lambda + 2\mu λ 2 = λ + 2 μ
q 23 = λ λ + μ , q 21 = 1 − q 23 = 2 μ λ + 2 μ q_{23} = \frac{\lambda}{\lambda +\mu}, \quad q_{21}=1-q_{23}=\frac{2\mu}{\lambda+2\mu} q 2 3 = λ + μ λ , q 2 1 = 1 − q 2 3 = λ + 2 μ 2 μ
q 20 = 0 q_{20} = 0 q 2 0 = 0
Finally,
λ 3 = 3 μ \lambda_3=3\mu λ 3 = 3 μ → \rightarrow → one service is completed.
q 32 = 1 q_{32}=1 q 3 2 = 1
q 31 = q 30 = 0 q_{31}=q_{30}=0 q 3 1 = q 3 0 = 0
Q = ( 0 1 0 0 μ λ + μ 0 λ λ + μ 0 0 2 μ λ + 2 μ 0 λ λ + 2 μ 0 0 1 0 ) Q=\begin{pmatrix}
0 & 1 & 0 & 0 \\
\frac{\mu}{\lambda+\mu} & 0 & \frac{\lambda}{\lambda+\mu} & 0 \\
0 & \frac{2\mu}{\lambda+2\mu} & 0 & \frac{\lambda}{\lambda+2\mu} \\
0 & 0 & 1 & 0 \\
\end{pmatrix}\\
Q = ⎝ ⎜ ⎜ ⎛ 0 λ + μ μ 0 0 1 0 λ + 2 μ 2 μ 0 0 λ + μ λ 0 1 0 0 λ + 2 μ λ 0 ⎠ ⎟ ⎟ ⎞
λ 0 = λ \lambda_0=\lambda λ 0 = λ
λ 1 = λ + μ \lambda_1=\lambda + \mu λ 1 = λ + μ
λ 2 = λ + 2 μ \lambda_2=\lambda + 2\mu λ 2 = λ + 2 μ
λ 3 = 3 μ \lambda_3=3\mu λ 3 = 3 μ
R i i = − λ 1 , R i j = λ i q i j R_{ii}=-\lambda_1, \quad R_{ij}=\lambda_i q_{ij} \\
R i i = − λ 1 , R i j = λ i q i j
R = ( − λ λ 0 0 μ − ( λ + μ ) λ 0 0 2 μ − ( λ + 2 μ ) μ 0 0 3 μ − 3 μ ) R=\begin{pmatrix}
-\lambda & \lambda & 0 & 0\\
\mu & -(\lambda +\mu) & \lambda & 0 \\
0 & 2\mu & -(\lambda +2\mu) & \mu \\
0 & 0 & 3\mu & -3\mu
\end{pmatrix}
R = ⎝ ⎜ ⎜ ⎛ − λ μ 0 0 λ − ( λ + μ ) 2 μ 0 0 λ − ( λ + 2 μ ) 3 μ 0 0 μ − 3 μ ⎠ ⎟ ⎟ ⎞
We see that R R R was simpler form than Q Q Q . Actually R i j i = j R_{ij}\quad i\cancel{=}j R i j i = j directly corresponds to the "rate" by which the process moves from state i i i to j j j . Thus, in practice, we often directly model R R R rather than getting { λ i } i ∈ S \{\lambda_i\}_{i\in S} { λ i } i ∈ S and Q = { q i j } i , j ∈ S Q=\{q_{ij}\}_{i,j\in S} Q = { q i j } i , j ∈ S
6.3. Classification of States
The matrix Q Q Q is a transition matrix of a DTMC (q i j ≥ 0 , ∑ j ∈ S q i j = 1 q_{ij}\geq 0,\quad \sum_{j\in S}q_{ij}=1 q i j ≥ 0 , ∑ j ∈ S q i j = 1 ), It contains all the information about the state changes, but "forget" the time.
Since accessibility, communication, irreducibility, recurrence/transience are only related to the change of states, not the time, these properties will be the same for the CTMC and its discrete skeleton.
For a CTMC { X ( t ) } t ≥ 0 \{X(t)\}_{t\geq 0} { X ( t ) } t ≥ 0 , we call "state j j j is accessible from state i i i ", "i i i and j j j communicate", "the process is irreducible", "i i i is recurrent/transient" if and only if this is the case for its discrete skeleton.
6.3.1. Positive / Null Recurrence
Note that since positive/null recurrence do involve the (expected) amount of time, we can indeed have different results for a CTMC and its discrete skeleton.
Let R i R_i R i be the amount of (continuous, random) time the MC (re)visits state i i i .
A state i i i is called positive recurrent, if it is recurrent, and E ( R i ∣ X ( 0 ) = i ) < ∞ \mathbb{E}(R_i|X(0)=i)<\infty E ( R i ∣ X ( 0 ) = i ) < ∞ . It is called null recurrent, if it is recurrent and E ( R i ∣ X ( 0 ) = i ) = ∞ \mathbb{E}(R_i|X(0)=i)=\infty E ( R i ∣ X ( 0 ) = i ) = ∞
As in the discrete-time case, the positive recurrence, null recurrence and transience are class property
6.4. Stationary Distribution
Definition 6.4.1. Stationary Distribution
A distribution π ‾ = ( π 0 , π 1 , ⋯   ) \underline{\pi}=(\pi_0,\pi_1,\cdots) π = ( π 0 , π 1 , ⋯ ) is called a stationary distribution of a CTMC { X ( t ) } t ≥ 0 \{X(t)\}_{t\geq 0} { X ( t ) } t ≥ 0 with generator R R R is it satisfies:
π ‾ ⋅ R = 0 → ( 0 , 0 , ⋯   ) \underline{\pi}\cdot R=0 \rightarrow (0,0,\cdots) π ⋅ R = 0 → ( 0 , 0 , ⋯ )
∑ i ∈ S π i = 1 \sum_{i\in S}\pi_i=1\quad ∑ i ∈ S π i = 1 (π ‾ ⋅ 1 ​ ​ ​ ​ ⊥ = 1 \underline{\pi}\cdot 1\!\!\!\!\perp = 1 π ⋅ 1 ⊥ = 1 )
Q: Why such a π ‾ \underline{\pi} π is called stationary?
A: Assume the process starts from the initial distribution α ( 0 ) ‾ = π ‾ \underline{\alpha^{(0)}}=\underline{\pi} α ( 0 ) = π :
P ( X ( 0 ) = i ) = π i \mathbb{P}(X(0)=i)=\pi_i
P ( X ( 0 ) = i ) = π i
Then the distribution at time t t t is given by
α ‾ ( t ) = α ‾ ( 0 ) ⋅ P ( t ) = π ‾ ⋅ P ( t ) \underline{\alpha}^{(t)}=\underline{\alpha}^{(0)}\cdot P(t)=\underline{\pi}\cdot P(t)
α ( t ) = α ( 0 ) ⋅ P ( t ) = π ⋅ P ( t )
Reason:
α j ( t ) = P ( X ( t ) = j ) = ∑ i ∈ S P ( X ( t ) = j ∣ X ( 0 ) = i ) P ( X ( 0 ) = i ) = ∑ i ∈ S P i j ( t ) ⋅ α i ( 0 ) = ( α ‾ ( i ) ⋅ P ( t ) ) j \begin{aligned}
\alpha_j^{(t)}
&= \mathbb{P}(X(t)=j)\\
&= \sum_{i\in S}\mathbb{P}(X(t)=j|X(0)=i)\mathbb{P}(X(0)=i)\\
&= \sum_{i\in S}P_{ij}(t)\cdot \alpha_i^{(0)}\\
&=(\underline{\alpha}^{(i)}\cdot P(t))_j\\
\end{aligned} α j ( t ) = P ( X ( t ) = j ) = i ∈ S ∑ P ( X ( t ) = j ∣ X ( 0 ) = i ) P ( X ( 0 ) = i ) = i ∈ S ∑ P i j ( t ) ⋅ α i ( 0 ) = ( α ( i ) ⋅ P ( t ) ) j
⇒ d d t ( ( ‾ α ) ( t ) ) = d d t ( π ‾ ⋅ P ( t ) ) = π ‾ ( d d t P ( t ) ) c-k equation = π ‾ lim n → 0 + P ( t + n ) − p ( t ) t = π ‾ lim n → 0 + P ( n ) P ( t ) − P ( t ) n = π ‾ ( lim n → 0 + P ( n ) − I n ) P ( t ) 0 ‾ = π ‾ ⋅ R ⋅ P ( t ) = ( ‾ 0 ) \begin{aligned}
\Rightarrow \frac{d}{dt}(\underline(\alpha)^{(t)} )
&= \frac{d}{dt}(\underline{\pi}\cdot P(t)) \\
&= \underline{\pi}(\frac{d}{dt} P(t)) \\
\text{c-k equation} &= \underline{\pi}\lim_{n\rightarrow 0^+}\frac{P(t+n)-p(t)}{t} \\
&= \underline{\pi}\lim_{n\rightarrow 0^+}\frac{P(n)P(t)-P(t)}{n} \\
&= \underline{\pi}(\lim_{n\rightarrow 0^+}\frac{P(n) - I}{n})P(t) \\
\underline{0} &= \underline{\pi}\cdot R\cdot P(t) \\
&= \underline(0)
\end{aligned}
⇒ d t d ( ( α ) ( t ) ) c-k equation 0 = d t d ( π ⋅ P ( t ) ) = π ( d t d P ( t ) ) = π n → 0 + lim t P ( t + n ) − p ( t ) = π n → 0 + lim n P ( n ) P ( t ) − P ( t ) = π ( n → 0 + lim n P ( n ) − I ) P ( t ) = π ⋅ R ⋅ P ( t ) = ( 0 )
⇒ α ‾ ( t ) \Rightarrow \underline{\alpha}(t) ⇒ α ( t ) is a constant (vector)
In other words, the distribution of X ( t ) X(t) X ( t ) will not change over time, if the MC start from the stationary distribution.
Fact 6.4.1. Stationary Distribution
If a CTMC starts from a stationary distribution, then its distribution will never change.
In the above derivation, we also see that
d d t ( P ( t ) ) = P ′ ( t ) = R ⋅ P ( t ) \frac{d}{dt}(P(t)) = P'(t)=R\cdot P(t)
d t d ( P ( t ) ) = P ′ ( t ) = R ⋅ P ( t )
This is called the Kolmogorov's Backward Equation
6.4.1. Basic Limit Theorem for CTMC
Let { X ( t ) } t ≥ 0 \{X(t)\}_{t\geq 0} { X ( t ) } t ≥ 0 be an irreducible, recurrent CTMC. Then
lim t → ∞ P i j ( t ) = : π j ′ = E ( T j ) E ( R j ∣ X ( 0 ) = j ) = 1 / λ i E ( R j ∣ X ( 0 ) = j ) \lim_{t\rightarrow\infty}P_{ij}(t)=:\pi_j'=\frac{\mathbb{E}(T_j)}{\mathbb{E}(R_j|X(0)=j)} = \frac{1/\lambda_i}{\mathbb{E}(R_j|X(0)=j)}
t → ∞ lim P i j ( t ) = : π j ′ = E ( R j ∣ X ( 0 ) = j ) E ( T j ) = E ( R j ∣ X ( 0 ) = j ) 1 / λ i
In addition, the MC is positive recurrent if and only if an unique stationary distribution exists. In this case, the stationary distribution is π ‾ ′ = ( π 0 ′ , π 1 ′ , ⋯   ) \underline{\pi}'=(\pi_0',\pi_1',\cdots) π ′ = ( π 0 ′ , π 1 ′ , ⋯ )
E ( T j ) E ( R j ∣ X ( 0 ) = j ) = long-run fraction of time spent in state j \frac{\mathbb{E}(T_j)}{\mathbb{E}(R_j|X(0)=j)} = \text{long-run fraction of time spent in state $j$}
E ( R j ∣ X ( 0 ) = j ) E ( T j ) = long-run fraction of time spent in state j
Thus, π j ′ \pi_j' π j ′ is also the long-run fraction of time that the process spends in stat j j j .
6.5. Birth and Death Processes
Definition 6.5.1. Birth and Death Process
A Birth and Death Process is a CTMC such that S = { 0 , 1 , ⋯   , M } S=\{0,1,\cdots, M\} S = { 0 , 1 , ⋯ , M } , or S = { 0 , 1 , ⋯   } S=\{0,1,\cdots\} S = { 0 , 1 , ⋯ } , and q i j = 0 ‾ \underline{q_{ij}=0} q i j = 0 if ∣ j − i ∣ > 1 ‾ \underline{|j-i|>1} ∣ j − i ∣ > 1 .
The process can only change to neighbouring states:
q i , i − 1 + q i , i + 1 = 1 , i ≥ 1 q_{i,i-1}+q_{i,i+1} = 1,\quad i\geq 1
q i , i − 1 + q i , i + 1 = 1 , i ≥ 1
q 01 = 1 q_{01} = 1
q 0 1 = 1
For a birth and death process, we use a new set of parameters:
Denote
λ i = R i , i + 1 i = 0 , 1 , ⋯ μ i = R i , i − 1 i = 1 , 2 , ⋯ R = ( − λ 0 λ 0 μ 1 − ( λ 1 + μ 1 ) λ 1 μ 2 − ( λ 2 + μ 2 ) λ 2 μ 3 − ( λ 3 + μ 3 ) λ 3 ⋱ ⋱ ⋱ ) ⇒ R i i = − ( λ i + μ i ) i ≥ 1 R 00 = λ 0 \begin{aligned}
&\lambda_i = R_{i,i+1}\quad\quad i=0,1,\cdots\\
&\mu_i = R_{i,i-1}\quad\quad i=1,2,\cdots
\end{aligned}\\
R=\begin{pmatrix}
-\lambda_0 & \lambda_0 \\
\mu_1&-(\lambda_1+\mu_1)&\lambda_1 \\
& \mu_2&-(\lambda_2+\mu_2)&\lambda_2 \\
&&\mu_3&-(\lambda_3+\mu_3)&\lambda_3 \\\\
&&\quad\quad\quad\ddots&\quad\quad\quad\ddots&\quad\quad\quad\ddots
\end{pmatrix}\\
\Rightarrow
\begin{aligned}
&R_{ii} =-(\lambda_i+\mu_i) \quad i\geq 1 \\
&R_{00} =\lambda_0
\end{aligned}
λ i = R i , i + 1 i = 0 , 1 , ⋯ μ i = R i , i − 1 i = 1 , 2 , ⋯ R = ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎛ − λ 0 μ 1 λ 0 − ( λ 1 + μ 1 ) μ 2 λ 1 − ( λ 2 + μ 2 ) μ 3 ⋱ λ 2 − ( λ 3 + μ 3 ) ⋱ λ 3 ⋱ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎞ ⇒ R i i = − ( λ i + μ i ) i ≥ 1 R 0 0 = λ 0
λ i \lambda_i λ i 's are called the "birth rates" (p o p u l a t i o n + 1 population + 1 p o p u l a t i o n + 1 )
μ i \mu_i μ i 's are called "death rates" (p o p u l a t i o n − 1 population - 1 p o p u l a t i o n − 1 )
Example 6.5.1. Previous Example of Restaurant
R = ( − λ λ μ − ( μ + λ ) λ 2 μ − ( 2 μ + λ ) λ 3 μ − 3 μ ) R=\begin{pmatrix}
-\lambda &\lambda \\
\mu & -(\mu+\lambda) & \lambda\\
& 2\mu & -(2\mu+\lambda) & \lambda\\
&&3\mu & -3\mu\\
\end{pmatrix}
R = ⎝ ⎜ ⎜ ⎛ − λ μ λ − ( μ + λ ) 2 μ λ − ( 2 μ + λ ) 3 μ λ − 3 μ ⎠ ⎟ ⎟ ⎞
This is a birth and death process.
Birth rates: λ i = λ i = 0 , 1 , 2 , 3 \lambda_i = \lambda \quad\quad\quad i=0,1,2,3 λ i = λ i = 0 , 1 , 2 , 3
Death rates: μ i = i ⋅ μ i = 1 , 2 , 3 \mu_i = i\cdot \mu \quad\quad i=1,2,3 μ i = i ⋅ μ i = 1 , 2 , 3
In general, consider a M/M/S queueing system.
M: exponential interarrival time
M: service time is exponential E x p ( μ ) Exp(\mu) E x p ( μ ) (each server)
S: number of servers
X ( t ) = X(t) = X ( t ) = # of customers in the system at time t t t
Birth rate:
birth ⇔ \Leftrightarrow ⇔ arrival
⇒ λ 1 = λ \Rightarrow \lambda_1 = \lambda ⇒ λ 1 = λ
Death rate:
death ⇔ \Leftrightarrow ⇔ service done
When there are i i i customers:
Case 1 : i ≤ s i \leq s i ≤ s
i i i servers are busy. Each ∼ E x p ( μ ) \sim Exp(\mu) ∼ E x p ( μ )
⇒ \Rightarrow ⇒ total death rate μ i = i ⋅ μ \mu_i=i\cdot \mu μ i = i ⋅ μ
Case 2 : i > s i > s i > s
s s s servers are busy, μ i = s ⋅ μ \mu_i=s\cdot \mu μ i = s ⋅ μ
Thus, for M/M/S queue,
λ i = λ μ i = { i ⋅ μ i ≤ s s ⋅ μ i > s \lambda_i=\lambda\\
\mu_i=\begin{cases}
i\cdot \mu\quad\quad i\leq s\\
s\cdot \mu\quad\quad i > s
\end{cases}
λ i = λ μ i = { i ⋅ μ i ≤ s s ⋅ μ i > s
Example 6.5.2. A Population Model
Each individual gives birth to an offspring with exponential rate λ \lambda λ . (i.e. the waiting time until it gets the next offspring ∼ E x p ( λ ) \sim Exp(\lambda) ∼ E x p ( λ ) ). Each individual dies with exponential rate μ \mu μ . Let X ( t ) X(t) X ( t ) be the population size at time t t t .
The time until the next (potential) birth is the smallest amount the i i i i.i.d. birth time. ∼ E x p ( i ⋅ λ ) \sim Exp(i\cdot \lambda) ∼ E x p ( i ⋅ λ )
Thus the birth rate λ i = i ⋅ λ i = 0 , 1 , 2 , ⋯ \lambda_i=i\cdot \lambda\quad i=0,1,2,\cdots λ i = i ⋅ λ i = 0 , 1 , 2 , ⋯
Similarly, the death rate μ i = i μ i = 1 , 2 , ⋯ \mu_i=i\mu\quad i=1,2,\cdots μ i = i μ i = 1 , 2 , ⋯
R = ( 0 0 μ − ( λ + μ ) λ 2 μ − 2 ( λ + μ ) 2 λ ⋱ ⋱ ) ∗ : 0 is absorbing R=\begin{pmatrix}
0 & 0 \\
\mu & -(\lambda+\mu ) & \lambda \\
& 2\mu & -2(\lambda+\mu) & 2\lambda \\
&& \quad\quad\ddots&\quad\quad\ddots&
\end{pmatrix}\\
*: \text{0 is absorbing}
R = ⎝ ⎜ ⎜ ⎛ 0 μ 0 − ( λ + μ ) 2 μ λ − 2 ( λ + μ ) ⋱ 2 λ ⋱ ⎠ ⎟ ⎟ ⎞ ∗ : 0 is absorbing
If we do need to know { λ i } \{\lambda_i\} { λ i } and Q Q Q
λ i = − R i i = i ( λ + μ ) i ≥ 0 q i , i + 1 = i λ i ( λ + μ ) = λ λ + μ i ≥ 1 ( ∗ ) q i , i − 1 = i μ i ( λ + μ ) = μ λ + μ 1 ≥ 1 \lambda_i = -R_{ii}=i(\lambda+\mu)\quad i\geq 0 \\
q_{i,i+1} = \frac{i\lambda}{i(\lambda+\mu)} = \frac{\lambda}{\lambda+\mu} \quad i\geq1 \quad (*) \\
q_{i,i-1}=\frac{i\mu}{i(\lambda+\mu)}=\frac{\mu}{\lambda+\mu} \quad 1\geq 1\\
λ i = − R i i = i ( λ + μ ) i ≥ 0 q i , i + 1 = i ( λ + μ ) i λ = λ + μ λ i ≥ 1 ( ∗ ) q i , i − 1 = i ( λ + μ ) i μ = λ + μ μ 1 ≥ 1
( ∗ ) q 0 , 1 = 1. arbitrary since λ 0 = 0 (*)q_{0,1}=1. \text{ arbitrary since } \lambda_0=0
( ∗ ) q 0 , 1 = 1 . arbitrary since λ 0 = 0
Q = ( 0 1 μ λ + μ 0 λ λ + μ μ λ + μ 0 λ λ + μ ⋱ ⋱ ) Q=\begin{pmatrix}
0 & 1\\\\
\frac{\mu}{\lambda+\mu} & 0 & \frac{\lambda}{\lambda+\mu}\\\\
&\frac{\mu}{\lambda+\mu} & 0 & \frac{\lambda}{\lambda+\mu}\\\\
&& \quad\quad \ddots& \quad\quad \ddots
\end{pmatrix}
Q = ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎛ 0 λ + μ μ 1 0 λ + μ μ λ + μ λ 0 ⋱ λ + μ λ ⋱ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎞
We can further add immigration to the system. Individuals are added to the population according to a Poisson process with intensity α \alpha α .
This will not change the "rate" from i i i to i − 1 i-1 i − 1
The time until the next increase in population is now the minimum of two r.v's following E x p ( i λ ) Exp(i\lambda) E x p ( i λ ) (births) and E x p ( α ) Exp(\alpha) E x p ( α ) (immigration).
⇒ \Rightarrow ⇒ rate from i i i to i + 1 i+1 i + 1 becomes i λ + α i\lambda+\alpha i λ + α
R = ( − α α μ − ( λ + μ + α ) λ + α 2 μ − ( λ + 2 μ + α ) λ + α ⋱ ⋱ ) R=\begin{pmatrix}
-\alpha & \alpha \\
\mu & -(\lambda+\mu+\alpha) & \lambda + \alpha \\
&2\mu & -(\lambda+2\mu+\alpha) & \lambda + \alpha \\\\
&& \quad\quad\ddots& \quad\quad\ddots
\end{pmatrix}
R = ⎝ ⎜ ⎜ ⎜ ⎜ ⎛ − α μ α − ( λ + μ + α ) 2 μ λ + α − ( λ + 2 μ + α ) ⋱ λ + α ⋱ ⎠ ⎟ ⎟ ⎟ ⎟ ⎞
( ∗ ) λ i = R i , i + 1 = i λ + α i = 0 , 1 , ⋯ μ i = R i , i − 1 = i μ i = 1 , 2 , ⋯ ( ∗ ) not the "biological" birth rate, but the total rate by which the process goes from i to i + 1 \begin{aligned}
(*) \quad&\lambda_i = R_{i,i+1} = i\lambda +\alpha\quad\quad i=0,1,\cdots\\
&\mu_i = R_{i,i-1} = i\mu \quad\quad\quad\quad i = 1,2,\cdots\\
\end{aligned}\\
(*) \text{ not the "biological" birth rate, but the total rate by which the process goes from $i$ to $i+1$}
( ∗ ) λ i = R i , i + 1 = i λ + α i = 0 , 1 , ⋯ μ i = R i , i − 1 = i μ i = 1 , 2 , ⋯ ( ∗ ) not the "biological" birth rate, but the total rate by which the process goes from i to i + 1
{ λ i } \{\lambda_i\} { λ i } and Q Q Q will change accordingly.
Note that state 0 is no longer absorbing due to the immigration.
As we see, there are two main types of birth and death processes: queueing system and population model . THe key difference between them is that the birth rate in the queueing system is typically a constant (does not depend on the current state i i i ), while the birth rate in population model increases as i i i increases.
6.5.1. Stationary Distribution of a Birth and Death Process
{ π ‾ ⋅ R = 0 ‾ ( 1 ) π ‾ ⋅ 1 ​ ​ ​ ​ ⊥ = 1 ‾ ( 2 ) \begin{cases}
\underline{\pi}\cdot R = \underline{0}\quad (1)\\
\underline{\pi}\cdot 1\!\!\!\!\perp = \underline1 \quad (2)\\
\end{cases} { π ⋅ R = 0 ( 1 ) π ⋅ 1 ⊥ = 1 ( 2 )
( π 0 , π 1 , ⋯   ) ⋅ ( − λ 0 λ 0 μ 1 − ( λ 1 + μ 1 ) λ 1 μ 2 − ( λ 2 + μ 2 ) λ 2 ⋱ ⋱ ⋱ ) (\pi_0,\pi_1,\cdots) \cdot \begin{pmatrix}
-\lambda_0 & \lambda_0 \\
\mu_1 & -(\lambda_1+\mu_1)&\lambda_1\\
&\mu_2 & -(\lambda_2+\mu_2)&\lambda_2\\\\
&& \quad\quad\ddots& \quad\quad\ddots& \quad\quad\ddots
\end{pmatrix}
( π 0 , π 1 , ⋯ ) ⋅ ⎝ ⎜ ⎜ ⎜ ⎜ ⎛ − λ 0 μ 1 λ 0 − ( λ 1 + μ 1 ) μ 2 λ 1 − ( λ 2 + μ 2 ) ⋱ λ 2 ⋱ ⋱ ⎠ ⎟ ⎟ ⎟ ⎟ ⎞
( 1 ) ⇒ − λ 0 π 0 + μ 1 π 1 = 0 ⇒ π 1 = λ 0 μ 1 π 0 λ 0 π 0 − ( λ 1 + μ 1 ) μ 1 + μ 2 π 2 = 0 (1) \Rightarrow
\begin{aligned}
&-\lambda_0\pi_0+\mu_1\pi_1 = 0 \Rightarrow\pi_1=\frac{\lambda_0}{\mu_1}\pi_0\\
&\lambda_0\pi_0-(\lambda_1+\mu_1)\mu_1+\mu_2\pi_2 = 0
\end{aligned}
( 1 ) ⇒ − λ 0 π 0 + μ 1 π 1 = 0 ⇒ π 1 = μ 1 λ 0 π 0 λ 0 π 0 − ( λ 1 + μ 1 ) μ 1 + μ 2 π 2 = 0
Add this to the first equation, we have
− λ 1 π 1 + μ 2 π 2 = 0 ⇒ π 2 = λ 1 μ 2 π 1 -\lambda_1\pi_1+\mu_2\pi_2 = 0 \Rightarrow\pi_2=\frac{\lambda_1}{\mu_2}\pi_1
− λ 1 π 1 + μ 2 π 2 = 0 ⇒ π 2 = μ 2 λ 1 π 1
In general, adding the first i i i equations, we have
− λ 0 π 0 + μ 1 π 1 = 0 -\lambda_0\pi_0+\mu_1\pi_1= 0
− λ 0 π 0 + μ 1 π 1 = 0
λ 0 π 0 − ( λ 1 + μ 1 ) π 1 + μ 2 π 2 = 0 \lambda_0\pi_0-(\lambda_1+\mu_1)\pi_1+\mu_2\pi_2=0
λ 0 π 0 − ( λ 1 + μ 1 ) π 1 + μ 2 π 2 = 0
⋮ \vdots
⋮
λ i − 2 π i − 2 − ( λ i − 1 + μ i − 1 + μ i π i ) = 0 \lambda_{i-2}\pi_{i-2}-(\lambda_{i-1}+\mu_{i-1}+\mu_i\pi_i)=0
λ i − 2 π i − 2 − ( λ i − 1 + μ i − 1 + μ i π i ) = 0
− λ i − 1 π i − 1 + μ i π i = 0 -\lambda_{i-1}\pi_{i-1}+\mu_i\pi_i=0
− λ i − 1 π i − 1 + μ i π i = 0
⇒ π i = λ i − 1 μ i π i − 1 = ⋯ = λ 0 λ 1 ⋯ λ i − 1 μ 1 μ 2 ⋯ μ i π 0 \begin{aligned}
\Rightarrow \pi_i &= \frac{\lambda_{i-1}}{\mu_i}\pi_{i-1}\\
&=\cdots\\
&=\frac{\lambda_0\lambda_1\cdots\lambda_{i-1}}{\mu_1\mu_2\cdots\mu_i}\pi_0
\end{aligned} ⇒ π i = μ i λ i − 1 π i − 1 = ⋯ = μ 1 μ 2 ⋯ μ i λ 0 λ 1 ⋯ λ i − 1 π 0
Use ( 2 ) (2) ( 2 ) to normalize
1 = ∑ n = 0 ∞ π n = ( 1 + ∑ n = 1 ∞ Π j = 1 n λ j − 1 μ j 1=\sum_{n=0}^\infty \pi_n = (1+\sum_{n=1}^{\infty}\Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j}
1 = n = 0 ∑ ∞ π n = ( 1 + n = 1 ∑ ∞ Π j = 1 n μ j λ j − 1
⇒ π 0 = 1 1 + ∑ n = 1 ∞ Π j = 1 n λ j − 1 μ j π i = Π j = 1 i λ j − 1 μ j 1 + ∑ n = 1 ∞ Π j = 1 n λ j − 1 μ j \begin{aligned}
\Rightarrow
&\pi_0=\frac{1}{1+\sum_{n=1}^\infty\Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j}}\\
&\pi_i=\frac{\Pi_{j=1}^i\frac{\lambda_{j-1}}{\mu_j}}{1+\sum_{n=1}^\infty\Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j}}
\end{aligned}
⇒ π 0 = 1 + ∑ n = 1 ∞ Π j = 1 n μ j λ j − 1 1 π i = 1 + ∑ n = 1 ∞ Π j = 1 n μ j λ j − 1 Π j = 1 i μ j λ j − 1
Thus, a stationary distribution exists(the MC is positive recurrent, assuming irreducible) if and only if
∑ n = 1 ∞ Π j = 1 n λ j − 1 μ j < ∞ \sum_{n=1}^\infty\Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j } < \infty
n = 1 ∑ ∞ Π j = 1 n μ j λ j − 1 < ∞
Example 6.5.1.1. M/M/S Queue (cont'd)
λ i = λ μ i = { i μ i ≤ s s μ i > s \lambda_i = \lambda\quad\quad\mu_i=\begin{cases}
i\mu\quad i\leq s \\
s\mu\quad i>s
\end{cases} λ i = λ μ i = { i μ i ≤ s s μ i > s
∑ n = 1 ∞ Π j = 1 n λ j − 1 μ j = λ μ + λ μ ⋅ λ 2 μ + ⋯ + λ μ λ 2 μ ⋯ λ s μ + λ μ λ 2 μ ⋯ ( λ s μ ) 2 + λ μ λ 2 μ ⋯ ( λ s μ ) 3 + ⋯ ⎵ geometric series with ration λ s μ \begin{aligned}
&\sum_{n=1}^\infty \Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j}\\
&= \underbrace{\frac{\lambda}{\mu}+\frac{\lambda}{\mu}\cdot\frac{\lambda}{2\mu}+\cdots+\frac{\lambda}{\mu}\frac{\lambda}{2\mu}\cdots\frac{\lambda}{s\mu} + \frac{\lambda}{\mu}\frac{\lambda}{2\mu}\cdots(\frac{\lambda}{s\mu})^2+\frac{\lambda}{\mu}\frac{\lambda}{2\mu}\cdots(\frac{\lambda}{s\mu})^3+\cdots}_{\text{geometric series with ration $\frac{\lambda}{s\mu}$}}
\end{aligned}
n = 1 ∑ ∞ Π j = 1 n μ j λ j − 1 = geometric series with ration s μ λ μ λ + μ λ ⋅ 2 μ λ + ⋯ + μ λ 2 μ λ ⋯ s μ λ + μ λ 2 μ λ ⋯ ( s μ λ ) 2 + μ λ 2 μ λ ⋯ ( s μ λ ) 3 + ⋯
⇒ \Rightarrow ⇒ The sum is finite if and only if λ < s μ \lambda<s\mu λ < s μ
⇒ \Rightarrow ⇒ the process { X ( t ) } t ≥ 0 \{X(t)\}_{t\geq 0} { X ( t ) } t ≥ 0 is positive recurrent if and only if λ ⎵ arrival rate < s μ ⎵ maximal (total) service rate \underbrace{\lambda}_{\text{arrival rate}} < \underbrace{s\mu}_{\text{maximal (total) service rate}} arrival rate λ < maximal (total) service rate s μ
Example 6.5.1.2. Population Model (with immigration)
λ i = i λ + α μ i = i μ \lambda_i=i\lambda+\alpha\quad\quad\mu_i=i\mu
λ i = i λ + α μ i = i μ
∑ n = 1 ∞ Π j = 1 n λ j − 1 μ j = ∑ n = 1 ∞ Π j = 1 n ( j − 1 ) λ + α j μ \sum_{n=1}^\infty \Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j} = \sum_{n=1}^\infty\Pi_{j=1}^n\frac{(j-1)\lambda+\alpha}{j\mu}
n = 1 ∑ ∞ Π j = 1 n μ j λ j − 1 = n = 1 ∑ ∞ Π j = 1 n j μ ( j − 1 ) λ + α
lim j → ∞ ( j − 1 ) λ + α j μ = λ μ \lim_{j\rightarrow\infty}\frac{(j-1)\lambda+\alpha}{j\mu}=\frac{\lambda}{\mu}
j → ∞ lim j μ ( j − 1 ) λ + α = μ λ
If λ < μ \lambda<\mu λ < μ , then ∑ n = 1 ∞ Π j = 1 n ( j − 1 ) λ + α j μ < ∞ \quad\sum_{n=1}^\infty\Pi_{j=1}^n\frac{(j-1)\lambda+\alpha}{j\mu}<\infty ∑ n = 1 ∞ Π j = 1 n j μ ( j − 1 ) λ + α < ∞ by ratio test.
If λ > μ \lambda>\mu λ > μ , then ∑ n = 1 ∞ Π j = 1 n ( j − 1 ) λ + α j μ = ∞ \quad\sum_{n=1}^\infty\Pi_{j=1}^n\frac{(j-1)\lambda+\alpha}{j\mu}=\infty ∑ n = 1 ∞ Π j = 1 n j μ ( j − 1 ) λ + α = ∞ .
If λ = μ \lambda=\mu λ = μ , then α ≥ λ = μ \quad\alpha \geq \lambda =\mu α ≥ λ = μ , the ratio ( j − 1 ) λ + α j μ ≥ 1 \frac{(j-1)\lambda+\alpha}{j\mu}\geq 1 j μ ( j − 1 ) λ + α ≥ 1 for all j j j
⇒ \Rightarrow ⇒ the terms in the summation is non-decreasing
⇒ \Rightarrow ⇒ the s u m = ∞ sum = \infty s u m = ∞
If λ = μ , α < λ = μ \lambda=\mu,\quad \alpha <\lambda = \mu λ = μ , α < λ = μ :
Raabe-Duhamel's test : (not required content)
L : = lim n → ∞ n ( a n a n + 1 − 1 ) { > 1 converge < 1 diverge = 1 inconclusive L:=\lim_{n\rightarrow\infty} n(\frac{a_n}{a_{n+1}}-1)\begin{cases}
>1\quad\text{converge}\\
<1\quad\text{diverge}\\
=1\quad\text{inconclusive}\\
\end{cases} L : = n → ∞ lim n ( a n + 1 a n − 1 ) ⎩ ⎪ ⎨ ⎪ ⎧ > 1 converge < 1 diverge = 1 inconclusive
Here:
L = lim n → ∞ n ( n μ ( n − 1 ) λ + α ) = lim n → ∞ n ( n λ − ( n − 1 ) λ − α ( n − 1 ) λ + α ) = lim n → ∞ n ( λ − α ( n − 1 ) λ + α ) = λ − α λ < 1 \begin{aligned}
L&=\lim_{n\rightarrow\infty}n(\frac{n\mu}{(n-1)\lambda+\alpha})\\
&=\lim_{n\rightarrow\infty}n(\frac{n\lambda-(n-1)\lambda-\alpha}{(n-1)\lambda+\alpha})\\
&=\lim_{n\rightarrow\infty}n(\frac{\lambda-\alpha}{(n-1)\lambda+\alpha})\\
&=\frac{\lambda-\alpha}{\lambda} <1
\end{aligned}
L = n → ∞ lim n ( ( n − 1 ) λ + α n μ ) = n → ∞ lim n ( ( n − 1 ) λ + α n λ − ( n − 1 ) λ − α ) = n → ∞ lim n ( ( n − 1 ) λ + α λ − α ) = λ λ − α < 1
⇒ \Rightarrow ⇒ the sum = ∞ =\infty = ∞
Conclusion
To sum up, the CTMC is positive recurrent if and only if λ < μ \lambda<\mu λ < μ
Q : What happens if λ 0 = 0 \lambda_0=0 λ 0 = 0 ? (0 is absorbing)
A :
The chain is not irreducible; typically two classes:
{ 0 } \{0\}\quad\quad\quad\quad\quad { 0 } positive recurrent
{ 1 , 2 , ⋯   } \{1,2,\cdots\}\quad \text{ }\text{ }\quad { 1 , 2 , ⋯ } transient
But the chain does not necessarily end up with being in state 0 0 0 , because it can also have X ( t ) → ∞ X(t)\rightarrow\infty X ( t ) → ∞ . Whether this is a possibility depends on the relation between { λ i } \{\lambda_i\} { λ i } and { μ i } \{\mu_i\} { μ i } .